Font Size: a A A

Ensemble-Based Robust Chemometrics For Analyzing Metabonomics Dataset

Posted on:2018-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:S F ChenFull Text:PDF
GTID:2381330518975839Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
Metabonomics as an emerging field offers insight into physiological processes.It aims to investigate the global metabolic variance in biological systems via monitoring the levels of metabolites with small molecule in biofluids and biological tissues.It is common practice that metabolic profiles associated with two groups(e.g.,diseased versus control)are firstly gained from high-throughput analytical platforms like NMR and then analyzed by chemometric methods like partial least-squares discriminant analysis(PLS-DA).Metabolic profiles of high complexity bring out tremendous challenges to existing chemometrics methods.Typically,the informative variables are elicited from a single classifier,however it is often unreliable in practice.Thus,how to guarantee the robustness and the reliability the results of metabonomics data analysis has been paid more and more attention.In the current thesis,considering the requirements of data analysis in metabonomics,the potential of ensemble like bagging and boosting in improving the reliability and robustness of a single model,the properties of CT in automatically carrying out variable selection as well as measuring variable importance,and the promising modeling performance of SVM,we designed a series of new approaches for metabonomics data analysis as follows:(1).In the current chapter,boosting partial least-squares discriminant analysis(BPLSDA)has been used for 1H NMR analysis of lung cancer metabolism based on the serum samples.BPLSDA is implemented by firstly constructing a series of PLS-DA models on the various weighted versions of the original training set and then combining the recognition results and variable importance values from the constructed PLS-DA models to obtain the integrative classification results and variable importance values by weighted majority vote and maximized the absolute,respectively.As for the informative variable identification,three criteria,i.e.,variable importance in projection(VIP),regression coefficients,and weight coefficients have been considered.As comparison to BPLSDA,the conventional PLS-DA has also been investigated.Experimental results have shown that the inter-variety difference can be accurately and rapidly distinguished by BPLSDA and 1H NMR.Moreover,the introduction of boosting drastically enhances the performance of an individual PLS-DA,and BPLSDA can identify the most informative variables with reliability and robustness.(2).In the current chapter,to rectify the uncertainty of variable selection in traditional variable selection method,bagging and classification tree(CT)were combined to form a general framework(i.e.,BAGCT)for robustly selecting the informative variables,based on the advantages of CT in automatically carrying out variable selection as well as measuring variable importance and the properties of bagging in improving the reliability and robustness of a single model.In BAGCT,a set of parallel CT models were established based on the idea of bagging,each CT providing some endowed information such as the splitting variables and their corresponding importance values.The informative variables can be successfully spied via inspecting the variable importance values over all CTs in BAGCT.Taking the promising properties of support vector machine(SVM)into account,we used the informative variables identified by BAGCT as the inputs of SVM,forming a new classification tool abbreviated as BAGCT-SVM.A metabonomics data set by 1H NMR from the patients with lung cancer and the healthy controls was used to validate BAGCT-SVM with CT and SVM as comparisons.Results showed that BAGCT-SVM with less number of variables can give better predictive ability than CT and SVM.(3).In this chapter,considering the properties of boosting in improving the reliability and robustness of a single model,we designed an another robust variable selection strategy by combining boosting and CT,forming boosting CT(i.e,BSTCT).And the informative variables identified by BSTCT was used as inputs of SVM(i.e,BSTCT-SVM),forming a new approach to the arsenal of SVM methods.BSTCT-SVM has been also employed to analyze the 1H NMR-based metabonomics dataset from lung cancer,compared with CT and SVM.We demonstrated experimentally that BSTCT-SVM is able to automatically eliminate variable redundancy and yield better classification performance than traditional CT and SVM.
Keywords/Search Tags:Metabonomics, Chemometrics, Boosting Partial Least-Squares Discriminant Analysis, Bagging Classification Tree-Support Vector Machine, Boosting Classification Tree-Support Vector Machine, Lung Cancer
PDF Full Text Request
Related items