Font Size: a A A

Study On The New Methods For Protein-nucleic Acid Interaction Sites Prediction

Posted on:2016-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:S P WangFull Text:PDF
GTID:2191330461471229Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
The interaction between proteins and nucleic acids plays an important role on maintaining and promoting life phenomena in cell. Therefore, it has theoretical values to study the mechanism of protein-nucleic acid interaction, which will be useful to understand the cellular activities like metabolism, differentiation, reproduction, aging and signal transduction. In the protein and nucleic acid interaction, the nucleic acid-binding sites have an indispensable effect on the protein and nucleic acid interaction. The recognition of the nucleic-binding sites in protein will help us to deeper understand the styles and details of the interaction on residues and atoms level. On nucleic-binding sites recognition, traditional experiment tools have the advantages of high accuracy. But on the other hand, the techniques also have the shortage of complexity of experimental techniques and time-consuming. Other methods are needed to develop to recognize the nucleic acid-binding sites. The residues in proteins can be decoded by sequence and structural features as features vectors derived from proteins. Based on the feature vectors, machine learning methods are used to construct prediction models, which can be used to predict nucleic acid-binding sites in proteins. The details of our work are as follows:In the first chapter of the thesis, we introduce the basic knowledge about interaction between proteins and nucleic acid and its biological functions. Next, the RNA, DNA-binding prediction methods are summarized. The popular methods used to prediction model development include Support Vector Machine, Artificial Neural Network, Naive Bayes and Random Forest. These prediction models are based on a variety of features to train models to give good performances. At last, the problems in existing methods are discussed such as low generalization ability, data imbalance in data sets and poor prediction performances. To solve the above mentioned problems, we have proposed several specific methods. Results from our work have shown that that our methods have good ability to solve them. The next second and third parts of our work have given detailed information about our prediction methods..In the Second chapter, we proposed a new method for RNA-binding sites prediction. Three types of structural features and two kinds of sequence features compose the feature vectors in the model. In order to expend the dimensions of features and information of binding sites, sliding window and smoothing window methods are applied to encode the feature vectors. Data imbalance and low prediction performances are solved by synthesizing positive samples and ensemble method. The prediction results on testing sets show that our models have good prediction ability on RNA-binding sites. Two techniques are utilized to select important features in raw feature vectors and these features analyzed according to their number and categories. The PSSM features have essential role on RNA-binding sites prediction. Finally, we compare our RF based method with other existing prediction methods; the results show that our method has higher prediction performances on testing sets.In the third chapter, we constructed prediction models based on RF to predict DNA-binding sites. Five kinds of sequence and structural features are used as input features, including the composition protein sequences, physicochemical properties, predicted secondary structures, accessible surface area and B factors. Our prediciton models are developed based on SMOTE and ensemble methods. We further select 150 important features by computing their information gain on raw feature vectors. The selected features are used to screen training set and testing set that forming new datasets. The prediction results on new datasets means that our representing features have effective ability on DNA-binding sites prediction.
Keywords/Search Tags:Protein-nucleic acid interaction, Nucleic acid-binding sites prediction, Sequence and structural features in proteins, SMOTE method, Ensemble method
PDF Full Text Request
Related items