Font Size: a A A

Multi-class Protein Folds Recognition Based On Random Forest

Posted on:2015-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z X FengFull Text:PDF
GTID:2180330467466078Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the accomplishment of the Human Genome Project, the “post genome era” haspresented large numbers of protein sequences that require a high-throughput computingmethod to annotate the structural information. Aprotein can only perform its physiologicalfunctions if it folds into its proper structure. Abnormal protein folding may cause differentdiseases. For example, the pathogenic prion protein (PRNP), caused by the abnormalfolding of proteins, accumulates in the brain and results in neurodegenerative diseasesincluding Alzheimer’s disease, spongiform encephalopathy, Parkinson’s disease, and madcow disease etc. Thus, the correct identification of protein folds can be valuable for thestudies on pathogenic mechanisms and drug design. Thus, the identification of proteinfolds is a highly important research project in bioinformatics. After the recognition of27-class protein folds in2001by Ding and Dubchak, algorithms, prediction parameters,and new datasets for the prediction of protein folds have been improved. Base on theprevious research, our major works are as follows:(1)Based on the76-class folds dataset built by Liu et al. in our group, the datasetwas reorganized in this paper, another8and5protein sequences were added into thetraining set and testing set respectively. The sequence identity of the dataset was below35%. The sequence number of each protein fold type in the dataset was not less than10.The training set and testing set contained1744and1727protein chains, respectively. Thefirst27types of folds are concordant with Ding and Dubchak’s dataset, and each folds typehas been expanded.(2)Considering the correlation at the level of secondary structure segments, weproposed the interaction information which reflects the segments-order and long-rangecorrelation information of the sequence. And the information has a major influence on thefolding of protein, which hasn’t been considered by previous researchers. As chemicalshifts reflects the structure information, the nature of hydrogen exchange dynamics,ionization and oxidation states, the ring current influence of aromatic residues, andhydrogen bonding interactions,we calculated ACS of secondary structure segments for thefirst time as feature parameter.(3)We identified the27-class protein folds dataset. Based on the27-class folds dataset built by Liu et al. in our group, we extract amino acid composition, motif frequency,predicted secondary structure information and calculate the interaction of predictedsecondary structure segments. Based on the ensemble classification strategy, with thecombined feature vector as input parameters of random forest algorithm, we identify the27-class protein folds and the corresponding structural classification by Jackknife test, theoverall accuracies of testing set and structural classification measure up to78.38%,92.55%respectively. Our work obtains better identification results than the previous reportedresults.(4)We identified the76-class protein folds dataset. Based on the76-class foldsdataset built by Liu et al., we reorganized the dataset, extracted values of Increment ofdiversity, motif frequency, predicted secondary structure information and the averagedchemical shifts of secondary structure segments. With combined feature vector as inputparameters for the Random Forest algorithm and ensemble classification strategy, weproposed a novel method for identifying76-class protein folds. For testing dataset inindependent test, the overall accuracy was66.69%; when combined the training set andtesting set in5-fold cross-validation, the overall accuracy was73.43%. Then this methodwas further used to predict27-class protein folds dataset, the testing dataset and thecorresponding structural classification were identified by independent tests, and the overallaccuracies measured up to79.66%and93.40%, respectively. Moreover, when combinedtraining set and testing set, the accuracy in5-fold cross-validation was81.21%. In addition,this approach produced better results when tested on the27-class protein folds dataset builtby Ding and Dubchak.
Keywords/Search Tags:Protein Fold, Interaction of segments, Predicted secondary structure, Motifs, Averaged chemical shifts
PDF Full Text Request
Related items