Font Size: a A A

The Vectors For Describing The Simility Between Protein Sequences And Its Effect For Recognizing DNA Binding Proteins

Posted on:2015-09-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P ZhangFull Text:PDF
GTID:1220330467465547Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
With the rapid growth of biological molecular data, there is generated a new interdisciplinary:bioinformatics, which uses these data to analyze and reveal the valuable information for human. The bioninformatics focus on is how to further study their structure and function by means of analysis biological sequences. In this paper, we mainly propose different computional approaches to study the structure and function of protein baing on the protein sequences. The main result in this dissertation is composed of three parts:In the second chapter, the DNA-binding proteins are functional proteins in a cell, which plays an important role in various essential biological activities. Therefore, we use the comprehensive feature information to predict DNA-binding proteins. We use the more comprehensive feature information of protein sequence, which mainly contain the sequence length information, and amino acid compostion information, evolutionary informathon, secondary structure information, physicochemical properties as well as functional information. Then each protein sequence is transformed into feature vectors. Furthermore, the generated features may not relevant to predict DNA-binding proteins and have a certain correlated/redundant with each other. Therefore, we use the different methods to exact features. Then the selected feature vectors as the input of SVM, our method can get accuracy of85.31%basing on five-fold cross validation, which has higher accuracy than DNA-Binder, DNA-prot and DNABIND methods in the same tested dataset DNAiset. Furthermore, our proposed method has prominent improved compared with the other methods in the realist tested dataset DNAiset. These results suggest that our method can achieve a better accuracy in prediction DNA-binding proteins.In the third chapter, alignment methods are one of the important methods in bioinformatics research, but these methods have high computional complexity. It is very difficult to implement these algorithms in long sequences, multiple sequence alignment and huge dastabase search. So many researches study the alignment-free methods. Therefore, we use the occurrence frequency of20amino acids and take the numerical vectors of graphical representation basing on three physicochemical properties indexes as the pseudo amino acid components, then a protein sequence is transformed into a23dimensional feature vectors. Basing on the similarities of nine species illustrates the effectiveness and rationality of our method, and correlation analysis has been provided to compare both our results and the results basing on the other graphical representation with Clustal W’s results, the correlation analysis results show that our method is superior to other methods and our method contatins more biological information. Furthermore, we use two new methods to get the numerical charactersics of protein sequence graphical representation, and we take the generated feature vectors basing on the pseudo amino acid composition method as the input of KNN and SVM to predict the DNA-binding proteins. Our method has low computational complexity and gets the accuracy of86%. These results show that our method is an effective to analyze similarity of protein sequences and predict the DNA-binding proteins.In the fouth chapter, we base on a computional method to analyze the C-terminal28amino acid residues of conserved suface exposed in H7N9NA protein. We determine the conservative of amino acid segment according to the variaties of sliding window, and generate the curve average solvent accessibility in the same sliding window; meanwhile, the conservative of segment and average solvent accessibility has better correlation than in signal amino acid position variability, then we determine the C-terminal28amino acid residues are conserved and located suface of protein. In addition, a RNA sequence conservation of3’-terminal and crystal structure of C-terminal are confimed the consevative suface28residues. Therefore, in the design of inhibitors of influenza virus H7N9, the conserved regions can probable be used as binding site.
Keywords/Search Tags:predict the DNA-binding proteins, graphical representationmethod, pseudo amino acid composition, conserved region of NA protein
PDF Full Text Request
Related items