Font Size: a A A

DNA-binding Protein Identification And Remote Homology Detection Based On Sequence-order Information

Posted on:2015-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:J H XuFull Text:PDF
GTID:2180330479489767Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of research for biology sequencing technology, the protein sequence data shows explosive growth, while the data of protein structure and function grows slowly. Therefore it is necessary to use the primary structure of the protein sequences to predict their functional and spatial structure. In this paper, two important tasks in the field of protein structure and function prediction(DNA-binding protein identification and protein remote homology detection) have been studied. One aim of this study is to explore the algorithms of extraction and utilization of protein sequence-order information, and incorporate this information into the prediction models so as to improve the predictive performance of computational methods. This paper extensively studied these two important tasks in the field of protein structure and function study by applying machine learning techniques, natural language processing methods and features based on protein sequence-order information. The concrete research content is as follows:Firstly, DNA-binding protein identification is an important problem in the field of protein function prediction. To address this problem, we proposed two methods, Pse DNA-Pro and i DNA-Prot|dis. This paper applied the Pse AAC(Pseudo Amino Acid Composition) to this area. To our best knowledge, it is the first attempt to use the concept of Pse AAC for DNA binding protein identification. To address the disadvantages of Pse AAC, we proposed a new computational method called Pse DNA-Pro. Besides Pse AAC, two feature extraction methods: OAAC(Overall Amino Acid Composition) and PDT(Physicochemical Distance Transformation) were also incoporated into Pse DNA-Pro. The final feature vectors constructed based on the three feature extraction methods were combined with SVM(Support Vector Machine) so as to build the Pse DNA-Pro for DNA-binding protein identification. Experimential results on two benchmark datasets showed that Pse DNA-Pro can achieve accuracy rates of 80.05% and 83.33%, respectively, outperforming other compared methods. Although Pse DNA-Pro achieved better performance than other approaches, it ignored the amino acid pairs at different distances. In order to make use of the long range sequence order effects, we proposed another computational method called i DNA-Prot|dis, which is based on amino acid distance pairs. In order to improve its performance and computational cost, i DNA-Prot|dis was futher improved by using the reduced alphabet strategy to group the amino acids with similar properties. Therefore, the length of the feature vectors were significatnly reduced. The experimential results on different benchmark datasets showed that i DNA-Prot|dis was an efficient computational method in this field. By analyzing the different feature weights in the SVM training model of this method, we found that this method can extract the features relecting the characteristics of DNA binding proteins.Secondly, Protein remote homology detection is the basic step for the study of protein structure and function. Here, we proposed two new computational methods for this task, including SVM-DR(DT) and dis Pse AAC. The i DNA-Prot|dis method has shown that the amino acid distance-pairs can effectively incorporate the sequence-order information. Here we tried to apply this approach to this field, and a method called SVM-DR(DT) was proposed, whose feature vectors were constructed based on the distance pairs. Finally, these feature vectors were input into SVM classifers to detect protein remote homology. By using Top-n-grams, the predictive results of this method were futher improved. The experimential results showed that the profile-based mehtod SVM-DT and the sequence-based method SVM-DR achieved the ROC socres of 0.948 and 0.919, respectively. By analyzing the SVM training model, we found that distance paris with shorter distance values were more important than those with longer distance values, which is fully consistent with the characteristics of protein families. Pse AAC utilizes the amino acid physicochemical properties, and the distance-pairs contain the sequence-order information over long distances. In order to combine their advantages into one computational method, a novel method called dis Pse AAC(distance-pair Pseudo Amino Acid Composition) was proposed. dis Pse AAC not only contains the sequence-order information, but also takes different physicochemical properties into consideration. Principal component analysis(PCA) was perfomed on the feature vectors of dis Pse AA for a smart representation, which reduces the dimensions of the feature vectors and removes the noise. Experimental results showed that the proposed method dis Pse AAC outperformed SVM-DR, Pse AACIndex, and some other the state-of-the-art methods in this field.
Keywords/Search Tags:DNA-binding protein identification, protein remote homology detection, support vector machine, pseudo amino acid composition, amino acid pair distances
PDF Full Text Request
Related items