Font Size: a A A

Study On Protein Function Prediction Based On Intelligent Computation

Posted on:2009-07-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:T L ZhangFull Text:PDF
GTID:1100360245478046Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Protein, which is an important part in cell, plays a critical role in life processing. Protein function prediction, i.e. classification of protein sequences according to their biological function is an important task in bioinformatics and protein science. The gap between the numbers of known protein sequences and the number of annotated protein is increasing rapidly. It is highly desired to develop some powerful tools and effectively methods to bridge the gap. Prediction of protein function with computational approaches is one of the most important research topics in protein science and bioinformatics. Meanwhile, finding the knowledge of relationship between protein sequence and its function is an important research field. This thesis mainly focuses on several important problems in prediction of protein function: protein subcellular localization, protein structural classes, and protein secondary structure prediction. We aim to develop some approaches to predict protein function from its sequence. The main contributions in the thesis are described as follows.First, we investigate the development of protein subcellular localization prediction.Its difficulties and further developments are summarized. According to the concept ofPseudo Amino Acid (PseAA) composition originally introduced by Chou, we proposean approach of improved PseAA (IPseAA) composition in which the weight factors areoptimized by immune genetic algorithm. Based on the approach of IPseAA, a novelfeature vector is developed to represent the sample of protein which incorporates theconcept of average power-spectral density and hydrobolicity pattern. Promising resultsare obtained when the method is used to predict eukaryotic protein subcellularlocalization. Then, we propose another approach to predict apoptosis proteinsubcellular localization. An ensemble classifier is proposed, in which the basicclassifier is fuzzy K nearest neighbors (FKNN) algorithm. Each basic classifier istrained by collocated amino acid pair composition. The collocated amino acid pair is apair amino acid with different spaces. Feature selection algorithm based on geneticalgorithm is used to get the optimazed features. The results of Jackknife and independent dataset tests indicate that the proposed approach is effective and practical.For prediction of protein structural classes, we propose three methods for it. 1) Based on binary-tree support vector machine (BT-SVM). Combined amino acid composition, correlation of amino acids in sequence, and hydrobolicity pattern, a novel PseAA composition is developed to represent sample of proteins. BT-SVM is used as prediction engine, which has capability in solving the problem of unclassifiable data points in multi-class SVMs. 2) Based on the concept of the approximate entropy (ApEn) and hydrophobicity patterns a novel approach is proposed to generate the PseAA composition for protein samples. FKNN classifier is used as prediction engine. A large and stringent dataset is adopt to validate the performance of the approach, encouraging results indicate the novel PseAA composition based on the concepts of ApEn and hydrophobicity patterns might reflect the core feature of proteins in different structural classes. 3) A two layers fuzzy support vector machine (FSVM) network is proposed to predict protein structural classes. In the first layer, the input data of the basic classifier (FSVM) is the PseAA composition based on different physi-chemical properties of amino acid. The outputs of FSVM in the first layer are combined into a vector. It is the input of the FSVM in the second layer.Nature language processing methods are introduced to handle the problem of protein seconday structure prediction. We propose an approach to predict protein secondary structure based on maximum entropy model. According to the contextual information of target residue and structural classes' information of protein sequence, feature space and feature templates are designed. All features, which are combined into an event, are incoporated into maximum entropy model. The models trained by the datasets with different structural classes, respectively. The features of protein secondary structure do not use any information from multi-profile, and the aim of the study is to help improving the function annotation of "orphan" protein which has no detectable homologs. Validated by the benchmark datasets, high predictive success rate denotes the approach might become a useful tool in related area.There are few studies on protein subnuclear localization prediction beacuse the nuclear is more compact and complicated as compared to other cell compartments. We develop an approach of ensemble of AdaBoost classifier. The PseAA based ApEn of sequence is used to represent the features of protein sequence. Two benchmarks are used to validate the performance of the approach. Compared with published works, the highest accuracy is achieved.The protein sequences in same family have same function. We can assume that some similarly important regions, Motifs, existing in sequences belong to same family. A Motif discovery method is proposed. In a protein family, a Motif set is searched to reprensent the family. The method has been used to identify ligases 21 subfamly. A Web-server is released for free science study.At last, a summary of the thesis is made, and the deficiency in the project and the further development are narrated respectively.
Keywords/Search Tags:protein function prediction, protein subcellular localization, protein structural classes, protein secondary structure, fuzzy K nearest neighbor classifier, fuzzy support vector machine, ensemble classifier, Motif discovery
PDF Full Text Request
Related items