Multi-class Machine Learning And Its Application Of Protein Structural Class Prediction

Posted on:2015-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:B Zheng

Full Text:PDF

GTID:2250330428464473

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

With the completion of the Human Genome Project and the development of bio-sequencingtechnology, vast amounts of protein sequence informations are produced. However, a great gapemerges between sharply increasing known protein sequences and slow accumulation of seniorprotein structures and functions.The traditional method of biological experiments have been unableto meet the demand, so it is meaningful to find a reliable and effective computational approach topredict protein structures and functions, which is a challenging task in front of the life informationscientists.Protein is the major performance and bearer of life activity. Revealing the inner mysteries oflife by studying the interaction of protein structure and function is the core of new centurybio-informatics research. Protein structural class is the key to the studies of protein structure andfunction, so the main content of this paper is centered around the prediction of protein structuralclasses. The problem is launched mainly from the following three aspects: extraction andcomposition of the protein sequence feature information, selection of the protein sequence multiplefeatures information and prediction of protein structural classes based on machine learning. In orderto further improve the prediction accuracy of protein structural classes, the paper mainly attemptedfrom the following three aspects, now the main works and innovations of this paper aresummarized as follows:1）Extraction and composition of the protein sequence feature informationThe quality of the extracted feature information directly affect the accuracy of prediction ofprotein structural classes. For a more comprehensive description of a given protein sequence, thispaper proposed a set of feature informations which can fully reflect the protein sequences. Itspecifically includes two feature extraction methods: k-word statistical frequency and thek-fragment distribution, they were respectively extracted the frequency and location information ofthe primary sequence of the protein, the physicochemical properties sequence of the protein andprotein secondary sequence. Information on the different nature of these features are combinedeffectively to overcome the shortcomings of a single characteristic information and it laid a solidfoundation for improving the prediction accuracy of protein structural classes.2）Selection of the protein sequence multiple features informationAlthough the combination of the more feature information of different nature can describeprotein sequences more comprehensive, but the reality is not the more characteristic dimensions,the higher classification accuracy is. Instead, the noise and redundancy due to high-dimensional feature information increased the amount of computation and complexity of the classificationmodel, it is not good to improve the classification accuracy and the marketing capabilities ofclassifiers. Therefore, this paper used feature information selection algorithm based on geneticalgorithm and applied it to the selection of protein sequence feature information. The main idea ofthe genetic algorithm is the "survival of the fittest", those with poor fitness will be graduallydiscarded by the increase of iteration numbers, those with good fitness will be retained andcontinue to breed. The feature informations which are got by genetic algorithm not only retainedmost informations of the original feature information sets, but also reduced the dimensionality ofthe feature sets. That will help to improve the performance of the classification model.3）Prediction of protein structural classes based on machine learningIn the prediction of protein structural class, machine learning algorithm is a very importantpart, directly related to the structure of the class prediction of success or failure. This study firstintroduced the three common single classification algorithm, namely artificial neural networks,bayesian algorithm and support vector machine. Taking into account the traditional classifieralgorithms have some flaws, no one can have a good distinguish ability on all samplecharacteristics, it then described four common multi-classifier fusion algorithms: majority voting,bayesian rules, mean method and weighted mean method. The decision results of commonmulti-classifier fusion algorithms individual classifiers are independent and are not through fullconsultations between the single classifier. That results in a loss of some of the decision-makinginformations. Because of this, this paper proposed a new multi-classifier fusion algorithm namedMa_Ada algorithm. The experimental results also show that Ma_Ada multi-classifier fusionalgorithm can make a greater degree improvement of prediction accuracy of protein structuralclasses.In summary, from the perspective of bio-informatics departure, this study systematicallyaddressed information problems such as extraction and multi-feature information combination ofprotein sequence information, selection of the protein sequence multiple features information,prediction of protein structural classes and so on. These research results will help to promotefurther study of protein structure and function. at the same time, they are good for the developmentof the protein sequence analysis and machine learning.

Keywords/Search Tags:

protein structural class prediction, feature extraction, genetic algorithm, machinelearing, multi-classifier fusion algorithm, Ma_Ada fusion algorithm

PDF Full Text Request

Related items

1	Low-similarity Protein Structural Class Prediction Based On Multiple Features
2	Predicting The Proteins Subcellular Localization Based On Physical And Chemical Features Fusion
3	The Classification Prediction Of High Dimensional Data Of Membrane Protein Based On Multi-feature Fusion
4	Study On Some Key Algorithms In Protein Structural Class Prediction
5	Based On Feature Fusion Protein Properties Prediction Of Multiple Points Of View
6	The Research On Prediction Of Protein Subcellular Location Using Multi-information Fusion Based On Sequence
7	Protein Structural Class Prediction Based On Feature Fusion
8	A Method And Its Application Research For Protein Subcellular Localization Prediction Based On Multi-label Learning
9	Research On Protein Structural Class Prediction Methods Based On Multi-future Information Fusion
10	A Multi-feature Fusion Algorithm For LncRNA Subcellular Localization Prediction Problem