Font Size: a A A

Research On Prediction Of Biological Function Of Small Molecules In Metabolic Pathway Using Data Mining

Posted on:2013-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:C R PengFull Text:PDF
GTID:1220330395453612Subject:Materials science
Abstract/Summary:PDF Full Text Request
Metabolism is the set of chemical reactions that happen in the cells of livingorganisms to sustain life. These processes allow organisms to grow and reproduce,maintain their structures, and respond to their environments. Small molecules involvedin the whole process of metabolic reactions. Small molecules are natural compoundswith relatively small molecular weight, usually less than1000daltons (especially lessthan400daltons). More than one hundred thousand small molecules can participate inmany biological process including metabolic reactions,but the number with knownbiological function is less than1%so far. Therefore,it’s conducive to understand thebiological and chemical nature of some questions in the process of life, through theresearch in recognition and prediction of biological functions of small molecules.However, mature methods and technologies are not available in the research up to now,and most knowledge comes from experience of experts. Too much time,manpower andresource are consumed by using massive experiments. Fortunately, the biologicalfunction of unknown small molecules can be predicted via collecting the results ofexpeirments and summarizing the implied regularities in known data by using datamining, which provide another way besides expeirence of experts.In order to recognize and predict the biological functions of small molecules byusing data mining,the ifrst problem is how to coding small molecules, which plays acrucial role for mathematical modeling. By compairng the existing commercial andopen source programs for the computation of molecular descriptors,‘Calculator Plugins,of ChemAxon was selected, and a program for the calculation of molecular descriptorswas developed. This program is the secondary development based on ’CalculatorPlugins’ by using Java language, which is easy to use and can be customized to thebatch calculation. This program has greatly improved the convenience and efifciency ofcalculation, which provide the high-eiffciency tool for the above research.Mapping small molecules to corresponding metabolic pathways correctly andeiffciently will contribute to the analysis of metabolic pathway and understand of metabolic mechanism in depth.4JChem for Excel’ of ChemAxon was chosen for batchcomputing descriptors of small molecules, mRMR (minimum Redundancy MaximumRelevance) and FFS (Feature Forward Search) algorithms were selected for featureselection, and Ada boost algoirthm based on C4.5decision tree algorithm was used forpredicting the possible metabolic pathway which small molecules involved in. Thus thepredicted accuracies of10-folds cross-validation test and independent set test for themetabolic pathway are83.88%and85.23%, respectively. The results have improvedsignificantly compared to the predicted results encoded by functional groupcomposition. The possible subpath way in metabolic pathway of lipid which smallmolecules involved in was predicted also.‘HyperChem’ was chosen for computingdescriptors of small molecules, CFS (Correlation-based Feature Subset) algoirthm wasselected for feature selection,and Bagging algorithm based on nearest neighboralgorithm was used for modeling. The predicted accuracies of Jackknifecross-validation and independent set are89.85%and91.46%, respectively.Small molecules participate in the whole metabolic process in metabolic pathwayvia the interaction with enzyme. Predicting unknown molecule-enzyme interactionaccording to known molecule-enzyme interaction can provide new idea for exploringvarious metabolic or catalytic mechanisms by the research on molecule-enzymeinteraction. The result of developed program ahead was used for coding smallmolecules, improved pseudo amino acid composition was used for coding enzymes, andthree algorithms were chosen for feature selection, including mRMR, IFS (IncrementalFeature Selection) and FFS. The prediction model was built for the molecule-enzymeinteraction in metabolic pathway by using nearest neighbor algorithm. The predictedaccuracies of10-folds cross-validation test and independent set test for themolecule-enzyme interaction are85.19%and85.32%respectively, and the predictedaccuracies of positive samples in10-folds cross-validation test and independent set testare86.02%and86.74%respectively. The predicted accuracies of positive samplesincreased greatly compared with previous work. The interaction of protein-RNA was studied by voting algorithm, which isconducive to understand the gene expression of protein.34classifiers were chosen rfom’Weka’, and four voting systems were built. As a result, the voting system performsbetter than any single classiifers,and algorithm selection and weighted system canoptimize the predicted accuracies. Weighted voting system with algoirthm selectionachieved the best prediction results, and the average ACC (overall prediction accuracy)value and average MCC (Matthew,s Correlation Coefifcient) value reached82.04%and64.70%respectively on the independent dataset.
Keywords/Search Tags:data mining, small molecules, molecular descirpter, metabolic pathway, ChemAxon, voting algorithm
PDF Full Text Request
Related items