Font Size: a A A

Prediction Of Protein Subcellular Location Using Improved KNN Classification Algorithm Based On Similarity Comparison

Posted on:2017-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:X F WangFull Text:PDF
GTID:2310330518480060Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The function of protein is closely related to its subcellular region.We can understand the functional information of protein by the prediction of subcellular interval,also making a significance in biological research.Traditionally,we obtained protein subcellular interval information by a experiment is not only time-consuming,high cost,but also not conducive for protein sequences of interval location.Therefore we need a efficient protein subcellular interval prediction method.In this paper,we introduce a protein sequence feature extraction algorithm and the traditional KNN classifier for improvement,then a kind of improved KNN protein subcellular classification algorithm of alignment based on similarity comparison,with an ensemble forecasting by AdaBoost and Bagging,where we get fruitful achievements.The main work of this paper as following:We mainly introduce the composition of amino acid,two peptide,pseudo amino acid composition of the three feature extraction algorithms;We constructs not only the public data set CH317,ZD98,but also the new data set Gram1253;The improvement of traditional KNN classifier;Using Blast comparison to finish the final KNN algorithm decision;New classification algorithm:by comparison similarity and testing three data sets,the success rate is 93.9%,91.5%and 92.5%respectively.the Hadoop distributed computing framework is applied in optimizing the algorithm.In order to study the prediction algorithm,this paper adopts AdaBoost and Bagging algorithm of multipled KNN classifiers to predict sequences of subcellular interval,after jackknife tests of three data sets,the AdaBoost prediction success rate was 94.9%,92.4%and 93.1%respectively.The uneven distribution of ZD98 and CH317 data sets cause the lower prediction accuracy of Bagging integration algorithm comparied with KNN algorithm,which is 89.8%and 87.7%.But the experimental results are good in Gram1253,which prediction accuracy rate amounts to 92.9%.The experimental results show that AdaBoost and Bagging ensemble classification prediction method is an effective method for protein subcellular interval prediction.
Keywords/Search Tags:Subcellular locations, Protein sequence characteristics, KNN, Blast, AdaBoost, Bagging
PDF Full Text Request
Related items