Machine Learning for Enhanced Identification Probability in RPLC/HRMS Nontargeted Workflows

Hiu-Lok Ngan, Viktoriia Turkina, Denice van Herwerden, Hong Yan, Zongwei Cai*, Saer Samanipour*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

In HRMS-based nontargeted analysis (NTA), spectral matching is crucial for chemical identification, particularly in the absence of retention information. This study introduces class probability of true positives (P(TP)) as an innovative approach, leveraging data from MS/MS spectra and calibrant-free predicted retention time indices (RTIs) through 3 machine learning (ML) models to enhance identification probability (IP). The first model is a molecular fingerprint (MF)-to-RTI model trained on 4713 calibrants. The second model, a cumulative neutral loss (CNL)-to-RTI model, utilized 485,577 experimental spectra. The final model, a binary classification model, was trained using 1,686,319 TP and semisynthetic true negative (TN) spectral matches. High correlations between MF-derived and CNL-derived RTI values (R2 = 0.96 for training; 0.88 for testing) suggest reduced RTI errors in TP spectral matches. Incorporating reference spectral library searches and RTI errors, the k-nearest neighbors algorithm achieved a weighted F1 score of 0.65 and a Matthews correlation coefficient of 0.30 for pesticides at concentrations of 1 to 1000 ppb in blank samples, with a recall of 0.60 in black tea matrices. Compared to solely library matching, the average IPs for pesticides increased by 54.5, 52.1, and 46.7% when spiked in blank, 10× diluted, and 100× diluted tea matrices, respectively. This work demonstrates the effectiveness of ML in enhancing the chemical IPs of annotated compounds within complex matrices.
Original languageEnglish
Pages (from-to)18028-18035
Number of pages8
JournalAnalytical Chemistry
Volume97
Issue number33
Early online date12 Aug 2025
DOIs
Publication statusPublished - 26 Aug 2025

Fingerprint

Dive into the research topics of 'Machine Learning for Enhanced Identification Probability in RPLC/HRMS Nontargeted Workflows'. Together they form a unique fingerprint.

Cite this