Font Size: a A A

Similarity Analysis And Application Of Protein Sequences Based On Positional Sequences

Posted on:2019-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2370330569477692Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the emergence of bioinformatics and the evolution and development of the human genome,more and more biological sequence data has been widely applied to scientific research.And biological sequence data contains a large amount of biological information,and with the rapid development of science and technology,more and more protein sequences need to be analyzed.Protein is the material basis for life activities.Through the understanding of proteins,human beings can better understand the essence of life and promote the healthy development of human beings.Proteins are also carriers of genetic information,so studying proteins has extremely important biological significance.Protein sequences determine the structure of a protein,which in turn determines the function of the protein.Therefore,the analysis of protein sequences is the basis for the analysis of the structure and function of proteins,and is also the basis for the study of unknown sequences based on known sequences.This research mainly carried out related research work based on amino acid location information around the similarity of protein sequence analysis.The main research contents include the following points:(1)In this study,we define two kinds of k-word average distances and use the defined distances to construct a numerical vector representation of protein sequences.Based on this,we propose two similarity analysis methods for protein sequences based on k-word position sequences.They are called the normalized k-word average relative distance method and the new normalized k-word average relative distance method,respectively.Using the Euclidean distance or Manhattan distance between vectors to obtain the relative distance between species,a phylogenetic tree of 9 species ND5 and 8 species ND6 protein sequence data sets was constructed using systematic clustering methods.Through the cross-validation method,it is found that the new normalized k-word average relative distance method performs better on the accuracy and standard deviation than normalized k-word average relative distance method.(2)Combining the normalized physicochemical properties of the nine amino acids and the frequency and average position of amino acid appearances,a 49-dimensional numerical vector representation of the protein sequence can be constructed.Using the Euclidean distance between vectors to characterize the similarity distance between species,the phylogenetic tree between sequences can be obtained.A phylogenetic tree of nine species of ND5 protein sequences and eight species of ND6 protein sequences are constructed using this method.The free-alifnment method proposed in this paper is evaluated by the similarity distance of Clustal W,which is a mature comparison method.The results show that: the method based on standardized physicochemical properties and the numerical representation method based on the k word-position sequence Compared with existing methods,not only the number of dimensions represented by the quantified vector is relatively low,but the final classification result is more ideal and stable.The method proposed in this paper was applied to the protein sequence datasets of influenza viruses of 28 species.The results show that the method in this paper can be applied widely and effectively.
Keywords/Search Tags:protein sequence, similarity analysis, physicochemical properties, system cluster analysis
PDF Full Text Request
Related items