| The development of molecular biology has established the core position of biological macromolecules such as DNA and protein in the study of microscopic life phenomena,so the analysis of sequences of biological macromolecules has become the primary task of studying the functions of biological molecules.There are two common methods for the analysis of biomolecular sequences,the multiple sequence alignment(MSA)method and the alignment-free(AF)method.The MSA methods are not suitable for large-scale genome analysis due to their high algorithm complexity.Therefore,especially in the last two decades,with the advent of the era of big data in genomics,alignment-free methods have fast developed.It has been more widely used in the calculation of sequence similarity,phylogenetic analysis,etc.Among them,the Natural Vector(NV)method proposed by Yau introduces the idea of statistics into sequence analysis,using single nucleotides as the object of feature extraction to describe their distribution.It has been successfully applied to the analysis and research of viral and bacterial genome sequences.However,the natural vector method can only process one-dimensional data similar to a string,and it only extracts the distribution of the single nucleotides.So it needs to be improved and expanded.This article mainly includes two contents.One is the expansion of the natural vector method.This part mainly proposes three new biological macromolecule sequence analysis methods.First,the Chaos Game Representation(CGR)has been combined with the natural vector method.By taking the CGR image as the object to extract the features,the Extended Natural Vector(ENV)is proposed to analyze the gray-scale image.Secondly,the Fourier Transform(FT)is also introduced into the natural vector method,and the central moment and covariance vector based on the Fourier transform power spectrum and phase spectrum is proposed,which can be used to distinguish gene structure.Finally,this article takes kmer as the object of feature extraction and proposes the kmer natural vector method.On this basis,a new genome metric is proposed,named as natural metric,which is used to accurately measure the differences between genomes.The second part is to apply the above methods to the large-scale genome analysis of SARS-Co V-2.We study the possible intermediate hosts of SARS-Co V-2 by using the above methods to compare SARS-Co V-2 with coronaviruses of different animal hosts.And the natural metrics between the complete genome of all SARS-Co V-2 in 2020 and bat coronaviruses are calculated to analyze their early transmission and origin.The coding sequences of the spike(S)proteins and the complete genome sequences of different types of SARS-Co V-2variants are analyzed by convex hull method.The convex hulls formed by the natural vector sets of the complete genome sequences of different types of SARS-Co V-2 variants are disjoint,which further verifies the principle of convex hull. |