Font Size: a A A

Protein Sequence Comparison And DNA-binding Protein Identification With Generalized PseAAC And Graphical Representation

Posted on:2021-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:J L ZhaoFull Text:PDF
GTID:2370330623475212Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the development of biological technology and the deepening of genomics and proteomics research,the amount of protein sequence data has grown rapidly.In the past few decades,the experimental techniques for testing and determining protein structures has made great progress,but it is still difficult to keep pace with the explosive growth of sequence information.However,as Anfinsen found,proteins contain within their amino acid sequences enough information to determine their native conformation.Therefore,it has become a really important topic in the field of Bioinformatics to develop an efficient computational approach for timely decoding proteinsequences and extracting useful information hidden in it.In this thesis,by means of two physicochemical properties of amino acids,a protein primary sequence was converted into a three-letter sequence.Furthermore,a simple graph without loops and multiple edges was obtained,and the concepts of geometric line adjacency matrix and line adjacency index were proposed.By combining these elements with the corresponding order-correlated factors,a generalized PseAAC?pseudo amino acid composition?model was constructed to characterize a protein sequence.Using the proposed mathematical descriptor of a protein sequence,similarity comparisons among?-globin proteins of 17 species and72 spike proteins of coronaviruses were made respectively.And under the new coronavirus epidemic situation,the relationship of the three major coronavirus outbreaks in the 21stcentury has been analyzed.At the same time,a generalized PseAAC based SVM?support vector machine?model was developed to identify DNA-binding proteins.Experimental results on the same datasets showed that our method was significantly better than the existing methods including DNAbinder,DNA-Prot,iDNA-Prot,and enDNA-Prot.Compared with the above four methods,our method has improved 3.29%-10.44%in terms of ACC,0.056-0.206 in terms of MCC,and 1.45-15.76%in terms of F1M.When the benchmark dataset was expanded with negative samples,the presented approach out performed the four previous methods with improvement in the range of 2.49-19.12%in terms of ACC,0.05-0.32 in terms of MCC,and 3.82-33.85%in terms of F1M.These results suggested that the proposed generalized PseAAC model was very efficient for comparison and analysis of protein sequences,and very competitive in identifying DNA-binding proteins.
Keywords/Search Tags:Pseudo amino acid composition, Identification of DNA-binding protein, Geometric line adjacency matrix, Phylogenetic analysis, Graphical representation
PDF Full Text Request
Related items