Font Size: a A A

Research On Novel Feature Detection Algorithms Applicable To Liquid Chromatography-mass Spectrometry Data

Posted on:2023-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:M TianFull Text:PDF
GTID:2530307070974469Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
The life activities in cells are shared by numerous genes,proteins,and small molecule metabolites.The metabolome is downstream of the gene regulation network and the protein regulation network,providing the end-point information of biology.Metabolomics aims to provide an unbiased and comprehensive quantification of metabolites in organisms,tissues or cells.For untargeted metabolomics,it simultaneously detects as many metabolites as possible in samples and discovers metabolomic variations between groups.Currently,liquid chromatography-mass spectrometry(LC-MS)is the most widely used analytical technique for metabolomic analysis,and LC-MS analysis often produces complex three-dimensional datasets,which makes it rather difficult to extract real features from LC-MS datasets.Therefore,the study of feature detection algorithms is of particular importance.This thesis focuses on the feature extraction algorithms for LC-MS datasets.The details are as follows:(1).We used XCMS to perform feature detection,feature matching,retention time alignment and missing value filling on LC-MS based Nicotiana tabacum L.leaves datasets.The results of t-distributed stochastic neighbour embedding(t-SNE)on the dataset clearly demonstrates the separation trend of different maturity grades of flue-cured NTL leaves.The discriminant models between different maturity grades were established using orthogonal partial least squares discriminant analysis(OPLS-DA).The quality metrics of the models are R~2Y=0.939 and Q~2=0.742(unripe and ripe),R~2Y=0.900 and Q~2=0.847(overripe and ripe),R~2Y=0.972 and Q~2=0.930(overripe and unripe).XCMS was used to analyse the liquor and left-sided colon cancer(LCC)datasets and built an extreme gradient boosting(XGBoost)model.The results show that the model based on XCMS achieves good classification accuracy.XCMS was employed to analyse the Arabidopsis thaliana dataset and built the OPLS-DA model,showing that the model has good predictive capability.This chapter is the cornerstone of the study of feature detection algorithms and clearly demonstrates the metabolomics analysis process based on the XCMS approach for feature extraction,which can be compared and analysed with subsequent studies.(2).Pure ion chromatograms were extracted from the liquor dataset and the LCC dataset by K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0(KPIC2).The fusion of unified manifold approximation and projection(UMAP)and XGBoost allows for non-linear low-dimensional embedding of high-dimensional data and discriminative modelling.Results show that the features extracted by KPIC2 achieve 100%classification accuracy on both the test sets of the liquor and LCC datasets.Compared with XCMS,the XGBoost model built on KPIC2 is more accurate and reasonable.KPIC2 was also used to process the Arabidopsis thaliana dataset,demonstrating the reliability of the model and the soundness of the KPIC2 framework.The integration of UMAP and XGBoost into the KPIC2 package extends its visualisation and modelling performance on complex datasets,which are not only able to effectively process nonlinear dataset but also can greatly improve the accuracy of data analysis in non-target metabolomics.(3).We have developed a deep learning-based pure ion chromatography method(DeepPIC)that can be used to extract PICs directly from raw files.The method learns rules for detecting PICs from input and output data by building a U-Net network.Its input and output are slices of the original LC-MS data and the corresponding PICs,respectively.The DeepPIC model was trained,validated and tested on an Arabidopsis thaliana dataset with 100 annotated PICs.Four different types of datasets were used to evaluate the performance of the method.Results show that the method is able to extract more realistic features on the MM48 dataset.The recall,precision and F1-score on the simulated MM48 dataset are better than XCMS and Feature Finder Metabo,and the method is more robust to the noise levels.The distribution of correlation coefficients between PIC and concentration is more concentrated than Feature Finder Metabo.The use of DeepPIC with five datasets from different samples and instruments demonstrates the method’s good generalisation capabilities.The method in combination with KPIC2provided the functionality required for the entire process from raw data to discriminant analysis.Finally,the advantages and disadvantages of XCMS,KPIC2 and DeepPIC were measured from several perspectives based on the Arabidopsis thaliana dataset,and results show that DeepPIC method has significant advantages over XCMS and KPIC2.
Keywords/Search Tags:LC-MS, XCMS, pure ion chromatography, OPLS-DA, XGBoost, KPIC2, U-Net network, metabolomics
PDF Full Text Request
Related items