Font Size: a A A

Research On Topics Of Bioinformatics Employing Ensemble Learning Algorithm

Posted on:2010-10-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:B NiuFull Text:PDF
GTID:1100360278976320Subject:Materials science
Abstract/Summary:PDF Full Text Request
In the late 20th century, with the rapid development of bioscience techniques,human genomics and other life genomics, the information of biology increased with surprising speed, which greatly enriched the bioinformation resource and led to the birth of bioinformatics. In Bioioformatics, researchers try to discover encyclopedic biological knowledge by captureing, managing, depositing, retrieving and analyzing biological information. Data mining technology is used to extract potential and useful information from the databases, and is playing an increasingly important role in the study of bioioformatics. In this paper, ensemble learning methods were used to investigagete some topics of bioinformatics. The main work of the paper contains following four parts:1. Using ensemble learning algorithm to study the prediction of protein structure and function types. With the success of human genome project, the protein sequences entering into the data banks are rapidly increasing. The structures and functions of these proteins may be determined by means of experiments, but it is very time-consuming and almost impossible. Thus the scientists have being sought after the theoretical or computational methods for predicting the structures and functions of proteins. AdaBoost and Bagging were employed to classify or predict protein structures and function locations based on sequence amino acid composition in this dissertation. During the modeling process, four different weak machine learning mtehod were used to build model, and the modeling parameters were optimized based on the results of cross-validation of the models. The results show that: (1) The best model with prediction accuracies of 94.18% and 85.90% were obtained by using AdaBoost-RandomForest in leave-one-out cross-validation for two standard data set of protein structure, respectively; (2) The best models with prediction accuracy of 91.80% and 80.80% were obtained by using AdaBoost-C4.5 in leave-one-out cross-validation for subcellular location of Prokaryotic and Eukaryotic Proteins, respectively;(3) The best model with a correct rate of 84.42% was obtained by using Bagging-KNN in leave-one-out cross-validation for membrane protein. All the prediction accuracies by using ensembe learning method are better than the previous results reported. Based on the models of predicting subcellular location and membrane protein, two corresponding online web servers were established.2. Using ensemble learning algorithm to study the prediction of small molecules'metabolic pathways and small molecule and enzyme interaction-ness. Firstly, based on AdaBoost method and featured by function group composition, a novel approach is proposed to quickly map the small chemical molecules back to the possible metabolic pathway that they belong to. As a result, 10 folds cross validation test and independent set test on the model reached 74.05% and 75.11%, respectively. Secondly, based on above research, we try to use amino acid physicochemical properties to code enzyme, resulting in totally 160 features. These features are input into AdaBoost classifier to predict the interaction-ness. As a result, the overall prediction accuracies, tested by 10-folds cross-validation and independent set, are 81.76% and 83.35%, respectively. Based on the models of prediction of small molecules'metabolic pathways, small molecule and enzyme interaction-ness, two corresponding online web servers were built.3. AdaBoost Learner is employed to investigate toxic action mechanisms of phenols based on molecular descriptors. 274 phenols from different references were collected, and 45 descriptors were calculated. Firstly, 9 descriptors were selected by using CFS (Correlation-based Feature Subset) method. Then C4.5,RandomTree,RandomForest and K nearest neighbors (KNNs) were employed as basic classifiers of AdaBoost to build the model, and C4.5 is selected. Finally, the performance of AdaBoost Learner is compared with support vector machine (SVM) and, KNN which are the most common algorithms used for SARs analysis. As a result, AdaBoost Learner performed better than SVM and KNNs in predicting the mechanism of toxicity of phenols based on molecular descriptors. It can be concluded that AdaBoost has a potential to improve the performance of SARs analysis. We also developed an online web server for the prediction of ecotoxicity mechanisms of phenols.4. Knowledge of the polyprotein cleavage sites by HIV protease will refine our understanding of its specificity, and the information thus acquired is useful for designing specific and efficient HIV protease inhibitors. Recently, a number of classifier creation and combination methods were proposed to approach the HIV-1 protease specificity problem. The pace in searching for the proper inhibitors of HIV protease will be greatly expedited if one can find an accurate, robust, and rapid method for predicting the cleavage sites in proteins by HIV protease. In this work, we selected HIV-1 protease as the subject of the study. Two hundred ninety-nine oligopeptides were chosen for the training set, while the other sixty-three oligopeptides were taken as a test set. The peptides are represented by features constructed by AAindex. The mRMR method (Maximum Relevance, Minimum Redundancy) combining with Incremental Feature Selection (IFS) and Feature Forward Search (FFS) are applied to find the 2 important cleavage sites and to select 364 important biochemistry features by jackknife test. Using KNN (K-nearest neighbours) with selected features, the prediction model with high accuracy rates of 91.3% and 87.3% were obtained for Jackknife cross-validation test and independent-set test, respectively. It is expected that our feature selection scheme can be used as a useful assistant technique for finding effective inhibitors of HIV protease.
Keywords/Search Tags:bioinformatics, ensemble learning, AdaBoost, Bagging, protein structure, subcellular location, membrane protein, metabolic pathway, small molecule, amino acid composition, functional group composition, HIV-1 protease, cross-validation test
PDF Full Text Request
Related items