Font Size: a A A

Prediction Method Research Of Protein Function Based On Support Vector Machine

Posted on:2013-01-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:S P ShiFull Text:PDF
GTID:1110330374464253Subject:Micro and Nano Materials Science and Engineering
Abstract/Summary:PDF Full Text Request
As the human society enters the post-genomic era, the functional annotation of genes has become the focus of scientific research. Based on the central dogma, the genes with the records of genetic information must be translated into proteins to carry out their functions. Thus, the study of protein function has become very important. Unfortunately, it is often laborious, time intensive and expensive to determine protein function using conventional experiments. Hence it is becoming a crucial issue to develop some reliable and high-throughput computational methods for identifying protein function. According to the research status of protein function prediction, some new coding methods of protein sequences were proposed to predict protein function from amino acid sequences on the basis of support vector machine. The main contents are summarized as follows:1. A new method was developed to identify submitochondria and subchloroplast locations. In this study, a different formulation of pseudo amino acid composition was constructed by the approach of discrete wavelet transform feature extraction. As a result of jackknife cross-validation, with our method, it can efficiently distinguish mitochondrial proteins from chloroplast proteins. The predictive accuracy for submitochondria and subchloroplast locations were3.7%~22.1%higher than those of other existing methods. Especially the predictive accuracy for mitochondrial outer membrane and chloroplast thylakoid lumen were greatly improved. These results indicate that the discrete wavelet transform can eliminate the noise components of amino acid sequence, and more effectively reflect the overall sequence order feature of a protein. Furthermore, we discussed the hydrophobic value and polarity to impact the results of forecasts, and found that polar characteristic tended to be greater for mitochondrial outer membrane, and hydrophobic characteristic was more prominent than polar characteristic for chloroplast thylakoid lumen.2. A new model called PMeS was constructed to predict methylarginine and methyllysine sites. Here, the feature encoding scheme is composed of encoding based on amino acid properties (EBP), position weight amino acid composition (PWAA) and solvent accessible surface area (ASA). PWAA is proposed to represent sequence-order information around methylation sites. EBP and ASA are utilized to characterize protein sequence information, physicochemical properties of amino acids and structural characteristic surrounding methylation sites. The predicted results by10-fold cross validation show that the PMeS algorithm is effective in identifying methylation status. Meanwhile, feature selection, the effect of window length, the ratio of positive to negative samples and the robustness of PMeS were investigated deeply. The results of the different cross-validation and independent test indicate that PMeS is stable, and significantly better than other predicting tools. We have implemented our algorithm as an online service (http://bioinfo.ncu.edu.cn/inquiries_PMeS.aspx).3. In this work, a novel approach called PLMLA that incorporates protein sequence information, secondary structure and amino acid properties was introduced to predict methylation and acetylation of lysine residues in whole protein sequences. An encoding scheme based on grouped weight and position weight amino acid compositions were applied to extract sequence information and physicochemical properties around lysine sites. The differences among methyllysine, acetyllysine and non-methyllysine and non-acetyllysine from the position specific properties, physiochemical properties and secondary structure were discussed in detail. The performance of models trained with various features reveal that the model with multiple features can make full use of the supplementary information among different features to improve classification performance. Based on the independent test, the predictive accuracy for methyllysine in PLMLA was about30.3%and37.88%higher than those in BPB-PPMS and MASA, respectively. For acetyllysine, the predictive accuracy in PLMLA was33.33%and36.11%higher than those in LysAcet and N-Ace, respectively. These indicate that the PLMLA significantly improve the current research status of the prediction of methyllysine and acetyllysine, and is an effective tool for identifying methylation and acetylation of lysine residues in whole protein sequences. The user-friendly online service is available at http://bioinfo.ncu.edu.cn/inquiries_PLMLA.aspx.4. A new method based on information entropy, amino acid properties and structural characteristic was developed to predict tyrosine nitration sites. It has carried on the preliminary discussion to the window of information entropy and traditional continuous window. The results show that the window of information entropy can effectively capture the important sites in the nitrotyrosine peptide; overcome the contradiction that the short peptide sequence is easy to lose information and the redundant information will be introduced by just increasing the length of peptide; and ultimately improve the prediction performance. Feature analysis reveals that local electrostatic environment of tyrosine residues, the adjacent evolutionarily conserved sites and long-range sites have some significant influences on tyrosine nitration. The detailed analysis of the results in this work might help understand the tyrosine nitration mechanism and guide the related experimental validation.
Keywords/Search Tags:support vector machine, discrete wavelet transforms, mitochondria, hloroplast, protein post-translational modification, encoding based onimino acid properties, position weight amino acid composition, solventaccessible surface area
PDF Full Text Request
Related items