Font Size: a A A

Prediction Of Protein Structure And Binding Site Based On Frequency Profile

Posted on:2011-06-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:1100330338489479Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the protein sequencing technology, more and more proteins with determined sequences are obtained. This provides us a chance to adopt data-driven technology to predict protein structure and function. The protein sequence frequency profile contains a lot of evolutionary information. Therefore it is a richer encoding of protein sequences than the individual sequence. It is of great significance to use such evolutionary information for soloving the problems in the field of bioinformatics. In this thesis, the methods for protein structure and function prediction are presented by using the evolutionary information extracted from the protein sequency frequency profile. The content of this thesis includes the following parts:Firstly, we present a novel building block of proteins called order profiles to use the evolutionary information of the protein sequence frequency profiles and apply this building block to produce a class of propensities called order profile long disorder propensities. The propensity, combined with position-specific scoring matrixes, are inputted to the Logistic Regression (LR) for the prediction of protein long disordered regions. In 5-fold cross-validation test, our method can achieve an area of 97.5% under the ROC cure. Testing on a blind-test set, our method is significantly more accurate than several existing long disordered region predictors. Compared with residue long disordered propensity, order profile long disordered propensity can significantly improve the predictive performance, indicating that the evolutionary information is important for protein long disordered region prediction.Secondly, within the field of protein remote homology detection and fold recognition, the discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. In this thesis, a novel protein vectorization method is presented. The method is based on Top-n-grams. Top-n-gram can be viewed as a novel building block of protein sequences, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are converted into Top-n-grams by combing the n most frequent amino acids. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods.Thirdly, a class of novel propensities at profile level is presented, namely, order profile interface propensities. For comparisons, we revisit the usage of residue interface propensities and binary profile interface propensities for protein binding site prediction. Each kind of propensities combined with sequence profiles and accessible surface areas are inputted into SVM. When tested on four types of complexes (hetero-permanent complexes, hetero-transient complexes, homo-permanent complexes and homo-transient complexes), experimental results show that the order profile interface propensities are better than residue interface propensities and binary profile interface propensities.Fourthly, we introduce a machine learning model hidden Markov support vector machine for protein binding site prediction. The model treats the protein binding site prediction as a sequential labelling task based on the maximum margin criterion. Common features derived from protein sequences and structures, including protein sequence profile, residue accessible surface area and order profile interface propensity, are used to train hidden Markov support vector machine. When tested on six data sets, the method based on hidden Markov support vector machine shows better performance than some state-of-the-art methods, including artificial neural networks, support vector machines and conditional random field. Furthermore, its running time is several orders of magnitude shorter than that of the compared methods. When order profile interface propensity is added to HM-SVM as an extra feature, the performance of HM-SVM can be significantly improved. The improved prediction performance and computational efficiency of the method based on hidden Markov support vector machine can be attributed to the following three factors. Firstly, the relation between labels of neighbouring residues is useful for protein binding site prediction. Secondly, the kernel trick is very advantageous to this field. Thirdly, the complexity of the training step for hidden Markov support vector machine is linear with the number of training samples by using the cutting-plane algorithm.
Keywords/Search Tags:Protein sequency frequency profile, Top-n-gram, Order profile, Protein long disordered region, Protein remote homology, Protein-protein binding site
PDF Full Text Request
Related items