Metabolomics is a measure of dynamic changes in low-molecular-weight metabolites in biofluids or tissues.The basic strategy of metabolomics refers to the generation of large dataset based on the high-throughput analytical platform followed by the data analysis based on chemometrics.Typically,such kind of dataset is noisy and high dimensional as well as tends to contain outliers.To analyse and maximize information retrieval from such kind of dataset,chemometrics methods are highly indispensable for metabolomics data analysis to achieve two central aims such as identifying the metabolic differences(i.e.,pattern recognition)among groups and screening out potential metabolites with significance(i.e.,variable selection).Technical improvement associated with the analytical platforms leads to the generation of data structures of increasing size and complexity.This brings great challenge to chemometris in existence.In consequence,it is highly demanded to develop new chemometrics methods for metabolomcis data analysis.Usually,in metabolomics data analysis,a single classifier with reasonable recognition capacity was built to indicate the informative variales,easily leading to some uncertanty.In the current thesis,with a view to the advantages of classification tree(CT)and random forest(RF)in automatically performing variable selection and detecting variable importance,the potential of ensemble algorithms based on model diversity in enhancing the reliability and stability of a single classifier and reducing running costs of algorithms,and the superior modeling performance of extreme learning machine(ELM),stacked autoencoders(SAEs)and hierarchical extreme learning machine(HELM),three new stability-based chemeomtrics methods were designed for metabolomcis data analysis.The newly-designed algorithms combined with GC-MS were applied for the screening of inborn errors of metabolism(IEMs).The specific work content is as follows:(1)In the present chapter,considering that classification tree(CT)can automatically choose the informative variables,and ELM indicates satisfactory prediction ability but fails to carry out variable selection,we developed a basic classifier CTELM by combining CT with ELM.In CTELM,a proper classification tree was firstly built,then ELM was built by using the splitting variables in CT as the inputs in ELM and the node number included in CT acted as the size of the hidden layer in ELM.Moreover,considering the fact that the selective ensemble algorithm can significantly improve the robustness and reliability of a single model,particle swarm optimization(PSO)and boosting were invoked to be combined with CTELM to form a new robust chemometrics method,i.e.,particle swarm optimization boosting classification tree extreme learning machine(PSO-BSTCTELM).In PSO-BST-CTELM,a series of CTELM models on the various weighted versions of the original training data based on the idea of boosting(BST-CTELM).Then the sub-models with high accuracy and large difference was selected via PSO to form the final integrated system.The proposed PSO-BST-CTELM,compared with BST-CTELM,CTELM and ELM,was applied for GC-MS urinary metabolomic analysis of two most common IEMs,i.e.,methylmalonic acidemia(MMA)and propionic acidemia(PA).The results revealed that the invokation of CT can well improve the interpretability of ELM,and PSO-BST-CTELM can further improve the generalization ability and the stability of single CTELM model.In addition,combined with one-way ANOVA and fold change,PSO-BST-CTELM identified three informative metabolites associated with MMA,including methylmalonic-2,3-OH-propionic-2 and methylcitric-4.3-OH-propionic-2,methylcitric-4 and tiglylglycine-1 were identified as the potential metabolites associated with PA.(2)As a well-performed machine learning algorithn,traditional SAEs is not applicable for metabolomics studies because of its difficulty in identifying contribution factors.To overcome this issue,we invoked bagging classificaiton tree(BAGCT)to be combined with SAEs to form a new chemometrics called BAGCT-SAEs for metabolomics data analysis,taking the good reliability and robustness of BAGCT in variable selection into account.In BAGCT,a set parallel of CT models were established based on bagging.Each CT can give some endowed information like the splitting variables and their corresponding contribution values.The most discriminative variables can be easily discovered via inspecting the variable contribution values over all CTs in BAGCT.The variables with importance values larger than zero were used as inputs of SAEs.The proposed BAGCT-SAEs,compared with SAEs,radial basis function(RBFN),support vector machine(SVM)and partial least squares discriminant analysis(PLSDA),was applied for GC-MS-based urinary metabolomic analysis of two most common IEMs,i.e.,glutaric acidemia type 1(GA1)and propionic acidemia(PA).The results revealed that BAGCT-SAEs compared favorably with other algrithms involved.In addition,combined with one-way ANOVA and fold change,BAGCT-SAEs identified two informative metabolites associated with GA1 including glutaric-2 and 2-OH-glutaric-3.3-OHpropionic-2,methylcitric-4,2-OH-butyric-2 and 2-methyl-3-OH-butyric-1-2 were identified as the potential metabolites associated with PA.(3)Although hierarchical extreme learning machine(HELM)holds a promising recognition ability,it is difficult in identifying the most discriminal variables.For making it suiable for metabilomics data analysis,a variale selection method of filter-type for HELM modeling is needed.In the current chapter,in view of the good reliability and robustness of Random forests(RF)in variable selection,we invoked RF to be allied with HELM to form a new chemometrics method,i.e.,RF-HELM.In RF-HELM,RF was used to set a parallel of multiple decision tree classifiers bulit on the idea of random resampling on the samples and variables.Once RF has been constructed,one can get the most discriminative variables which are acted as the inputs of HELM.In the current chapter,datasets related to two IEMs,i.e.,methylmalonic acidemia(MMA)and propionic acidemia(PA),were used to test the performance of the newly proposed RF-HELM with comparison to ELM,RBFN,SVM and PLSDA.The results revealed that the invokation of RF can effectively improve the model interpretaion of HELM ability and RF-HLEM show superior recognition ability to ELM,RBFN and PLSDA.In addition,combined with oneway ANOVA and fold change,RF-HELM identified three informative metabolites associated with MMA,including methylmalonic-2,3-OH-propionic-2 and methylcitric-4.3-OH-propionic-2,methylcitric-4 and tiglylglycine-1 were identified as the potential metabolites associated with PA. |