| The aim of protein multiple sequence alignment(MSA)is to find similar fragments between amino acid sequences by comparing the amino acid sequences of three or more proteins.Previous work has shown that similar proteins may have similar functions,understanding the protein function could provide support for biologists to identify drug targets,annotate genes,understand pharmacology and pathology,and design proteins.The rapid development of high-throughput sequencing technology makes it more convenient and efficient for researchers to obtain protein sequence information,but it also brings more protein sequences with unknown functions.Using traditional biochemical methods to determine protein function is not only time consuming,but it is also unattainable to handle large amounts of data,MSA makes it possible to infer unknown protein functions through protein sequence similarity,the accuracy of MSA will directly affect the efficiency and accuracy of protein function prediction.Therefore,it has become a major focus of Bioinformatics to propose a reliable proteins MSA and sequence similarity analysis algorithm.The primary structure of protein is the sequence of 20 different types of amino acids.The rich physical and chemical properties of these amino acids determine how and what functions proteins perform in life.In conclusion,this paper focuses on protein MSA algorithm based on amino acid properties encoding.The main work of this paper is as follows:(1)Multiple sequence alignment algorithm based on single amino acid property encoding.Four groups protein sequences with different sizes were encoded by four important properties of amino acids,and the numerical sequence of protein sequences was obtained as the initial sequence signal,respectively.Using Fast Fourier Transform to enrich the information in the digital initial sequences,combined with sliding window and Higuchi Fractal Dimension,calculated the fractal dimension of the sequences in each window,chose Cosine distance function to calculate the distance between the sequences,that is,the similarity degree,finally built the phylogenetic tree according to the distance matrix.A group of ND6 and three groups of globin protein sequences with different lengths were selected in this section and compared with previous work.The results of four experiments show that the proposed algorithm is more accurate and effective.(2)Multiple sequence alignment algorithm based on multi-properties of amino acid.The protein sequence data sets with large length difference were processed as follows: For shorter ND6 protein sequences,the sequence was encoded simultaneously based on nine properties of amino acids;For longer ND5 protein sequences,MSA was also conducted based on 9 amino acid properties,but the difference was that these 9 properties were not directly encoded,Principal Component Analysis algorithm was used to extract the principal component of 9 attributes,and the new principal component was used to encode the group of ND5 protein sequences to reduce the dimension of the processed data.Then Discrete Wavelet Transform was used to decompose the time domain and frequency domain information of the sequence,and the Higuchi fractal dimension and Spearman distance matrix were combined to build the phylogenetic tree.The data sets used in this paper are common baseline data sets used in existing MSA algorithms.The experimental results show that the multiple sequence alignment and similarity analysis algorithm proposed in this paper is reliable for different types and lengths of protein sequences. |