Font Size: a A A

Predicting Protein-Protein Interactions Based On Support Vector Machine And Complete Protein Sequence

Posted on:2011-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2120330332958028Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Since the Human Genome Project has been finished, we have entered the post-genomic era in life science and proteomics has become important. Because protein is the primary component in life activities, protein-protein interaction (PPI) prediction is a hot topic in proteomics. Using the traditional experimental methods to find the PPI is not only time-consuming, labor-intensive and expensive, but also the results have highly false positive and false negative. So the development of advanced and highly efficient information analysis technologies to find the internal links from a large number of proteins and complex set of data is very important. Therefore, many researchers choose bioinformatics tools to study the PPI.In our Paper, the PPI data was downloaded from Saccharomyces cerevisiae core subset of database of interacting proteins (DIP), and the original positive dataset which had 5943 pairs of interacting proteins was chosen. We used the multiple sequence alignment tool to align the protein sequences and removed the protein pairs with higher than 40 percent sequence identity from the original dataset, then a non-redundant dataset was got, which had 5594 pairs of interacting proteins. Because the non-interacting protein pairs were not readily available, we used three strategies for constructing negative dataset. The first method was randomly pairing proteins that appeared in the positive dataset and the negative dataset obtained is called Prcp. The second method was based on such an assumption that proteins occupying different subcellular localizations did not interact, and the negative dataset obtained was called Psub.The third one was using the Shufflet program to shuffle the sequences of right-side interacting pairs with k-let (k=1,2,3).So we obtained the five negative datasets.The protein pairs with higher than 40 percent sequence identity in Psub negative dataset were removed, and the non-redundant negative dataset which had 5594 pairs of non-interacting proteins was constructed. We used two encoding methods to represent amino acids, which were the five encoding of the amino acids and seven physicochemical properties of the amino acids respectively. We used the support vector machine to construct models and predict PPI. When we used the five encoding method, the results of prediction was higher than those using seven physicochemical properties, the prediction accuracy for 1-let dataset and Psub dataset was 95.50% and 92.12%, respectively, and the prediction accuracy for the non-redundant dataset was 90.84%.The five cross-validation for the 1-let dataset and Psub dataset was 93.26% and 90.31%,respectively.At the end of this paper, we summarized our work and discussed the disadvantage of our method, and also gave the prospect of proteomics in future study.
Keywords/Search Tags:proteomics, protein-protein interaction prediction, bioinformatics, support vector machine
PDF Full Text Request
Related items