Font Size: a A A

Multi-class Machine Learning And Its Application Of Protein Structural Class Prediction

Posted on:2015-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhengFull Text:PDF
GTID:2250330428464473Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the completion of the Human Genome Project and the development of bio-sequencingtechnology, vast amounts of protein sequence informations are produced. However, a great gapemerges between sharply increasing known protein sequences and slow accumulation of seniorprotein structures and functions.The traditional method of biological experiments have been unableto meet the demand, so it is meaningful to find a reliable and effective computational approach topredict protein structures and functions, which is a challenging task in front of the life informationscientists.Protein is the major performance and bearer of life activity. Revealing the inner mysteries oflife by studying the interaction of protein structure and function is the core of new centurybio-informatics research. Protein structural class is the key to the studies of protein structure andfunction, so the main content of this paper is centered around the prediction of protein structuralclasses. The problem is launched mainly from the following three aspects: extraction andcomposition of the protein sequence feature information, selection of the protein sequence multiplefeatures information and prediction of protein structural classes based on machine learning. In orderto further improve the prediction accuracy of protein structural classes, the paper mainly attemptedfrom the following three aspects, now the main works and innovations of this paper aresummarized as follows:1)Extraction and composition of the protein sequence feature informationThe quality of the extracted feature information directly affect the accuracy of prediction ofprotein structural classes. For a more comprehensive description of a given protein sequence, thispaper proposed a set of feature informations which can fully reflect the protein sequences. Itspecifically includes two feature extraction methods: k-word statistical frequency and thek-fragment distribution, they were respectively extracted the frequency and location information ofthe primary sequence of the protein, the physicochemical properties sequence of the protein andprotein secondary sequence. Information on the different nature of these features are combinedeffectively to overcome the shortcomings of a single characteristic information and it laid a solidfoundation for improving the prediction accuracy of protein structural classes.2)Selection of the protein sequence multiple features informationAlthough the combination of the more feature information of different nature can describeprotein sequences more comprehensive, but the reality is not the more characteristic dimensions,the higher classification accuracy is. Instead, the noise and redundancy due to high-dimensional feature information increased the amount of computation and complexity of the classificationmodel, it is not good to improve the classification accuracy and the marketing capabilities ofclassifiers. Therefore, this paper used feature information selection algorithm based on geneticalgorithm and applied it to the selection of protein sequence feature information. The main idea ofthe genetic algorithm is the "survival of the fittest", those with poor fitness will be graduallydiscarded by the increase of iteration numbers, those with good fitness will be retained andcontinue to breed. The feature informations which are got by genetic algorithm not only retainedmost informations of the original feature information sets, but also reduced the dimensionality ofthe feature sets. That will help to improve the performance of the classification model.3)Prediction of protein structural classes based on machine learningIn the prediction of protein structural class, machine learning algorithm is a very importantpart, directly related to the structure of the class prediction of success or failure. This study firstintroduced the three common single classification algorithm, namely artificial neural networks,bayesian algorithm and support vector machine. Taking into account the traditional classifieralgorithms have some flaws, no one can have a good distinguish ability on all samplecharacteristics, it then described four common multi-classifier fusion algorithms: majority voting,bayesian rules, mean method and weighted mean method. The decision results of commonmulti-classifier fusion algorithms individual classifiers are independent and are not through fullconsultations between the single classifier. That results in a loss of some of the decision-makinginformations. Because of this, this paper proposed a new multi-classifier fusion algorithm namedMa_Ada algorithm. The experimental results also show that Ma_Ada multi-classifier fusionalgorithm can make a greater degree improvement of prediction accuracy of protein structuralclasses.In summary, from the perspective of bio-informatics departure, this study systematicallyaddressed information problems such as extraction and multi-feature information combination ofprotein sequence information, selection of the protein sequence multiple features information,prediction of protein structural classes and so on. These research results will help to promotefurther study of protein structure and function. at the same time, they are good for the developmentof the protein sequence analysis and machine learning.
Keywords/Search Tags:protein structural class prediction, feature extraction, genetic algorithm, machinelearing, multi-classifier fusion algorithm, Ma_Ada fusion algorithm
PDF Full Text Request
Related items