Font Size: a A A

Predict Proteinnucleotide Binding Site By Using Improved AdaBoost And KNN Algorithem

Posted on:2016-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:X XinFull Text:PDF
GTID:2180330482954845Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of computer and network technology, the human officially entered the era of big data, like all other disciplines, computer science and technology also has inestimable influence to biology. And with the coming of the post genome era, protein sequencing technology rapid development,which creates a protein sequence data of explosive growth. However compared to protein primary sequences, protein structure and function of the information for the human has a more important role, for the information of cognitive greatly promote the biology, the rapid development of life science and pharmaceutical engineering, and other fields. Thus, many researchers have devoted to the spatial structure of protein and protein function. Early method based on biology because of its great time cost and the economic cost has not fully meet the needs of development, therefore,bioinformatics arises at the historic moment. The researchers began by computer method to predict protein structure and function, and achieved gratifying achievements in this field.It is well known that proteins in the body is not exist in isolation, it need to via the interaction with other particles to achieve a specific function. The interaction of particles with proteins we collectively known as the ligand, As a kind of important ligands and nucleotides has its distinctive characteristics. Obvious, understand the interaction mechanism of protein- nucleotides to further understand the protein function play an important role, Therefore, judge protein- nucleotide interaction sites has become a very hot research topic in recent years.KNN classifier is an ancient and practical, it has high robustness and stability, are widely used in machine learning and data mining areas. The basic idea is found in many training samples and sample under test "closest" K samples, and through these samples type distribution to determine the classification of the sample under test results. In biology, the research proves that the more similar the protein sequence, themore likely it has similar structure and function. KNN, therefore, this method is simple and intuitive in protein- nucleotide binding site prediction has achieved quite competitive results. KNN algorithm, however, there are still serious shortcomings,namely under the condition of the sample distribution is not spread evenly over the prediction performance significantly decreased. The protein- nucleotide binding site of sample tilt data has a very serious problem, a tremendous difference in the number of positive and negative samples. According to this problem, an A- KNN algorithm is proposed in this paper, based on the AdaBoost algorithm A- KNN undersampling was carried out on the training set, Form N weak training set, using the improved KNN algorithm are constructed on each weak training set N weak classifier, Then the weak classifiers integrated to become a strong classifier, and form the final prediction results.Experimental results show that A- KNN compared with the original KNN algorithm in accuracy and MCC indicators have made significant improvement. And in the case of artificial add noise data, our algorithm reduces the noise data effects the result of the classification. In comparison with the algorithm is good, we A- KNN in accuracy and MCC on two indexes were improved. A lot and is validated by the test specification of our method can effectively improve the prediction of protein-nucleotide binding site.
Keywords/Search Tags:Protein, Nucleotides, Ada Boost, KNN, Sample tilt
PDF Full Text Request
Related items