Font Size: a A A

Mathematical Methods Of Similarity Comparison And Clustering Of Protein Sequences

Posted on:2019-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2370330572959961Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Computational molecular biology is a comprehensive discipline in which various disciplines intersect and infiltrate each other.It mainly conducts a series of complex treatment of biological experimental data,and serves in the fields of gene diagnosis,drug development,disease treatment and so on.The study of similarity of protein sequences has important theoretical and practical value in predicting the function of unknown proteins,classifying proteins and determining homologous evolutionary relationships of organisms.This article aims to explore some simple,rapid and effective mathematical methods for the analysis of protein sequences,providing a certain reference for the comparative analysis of biological sequences studied in the future.Its main work focuses on the comparison of similarity of protein sequences and the construction of clustering maps based on mathematical methods through dimensionality reduction.The research results of this paper can be summarized as follows:1.Using the physicochemical properties of different kinds of amino acids to characterize protein sequences,the protein sequences were converted into 11-dimensional and 16-dimensional feature vectors;Factor analysis model is used to reduce the dimension of the feature vector of the protein sequence to obtain the factor model,then the factor model was used to analyze the similarity of 40 G protein-coupled receptor sequences under different physicochemical properties and clustered them.2.The four types of amino acids physical and chemical properties:polar and hydrophilic pq,polar and hydrophobic pr,non-polar and hydrophilicsq,non-polar and hydrophobic sr two-by-two connected and 20 kinds of amino acids,Fourier transform is used to convert the character sequence of the protein into a digital sequence;The corresponding feature vectors of the protein sequence are obtained by Fourier power spectrum;The 31 protein sequences containing hemagglutinin proteins were similarly analyzed by the median distance between the feature vectors and a clustering map was constructed.3.Based on the 20 amino acids and their physicochemical properties that make up the protein sequence,the 40-dimensional feature vector is decomposed into 20-,16-,and 4-dimensional feature vectors to analyze the correlation of the protein sequence under different eigenvectors,and the low-dimensional and effective feature vector is selected to hierarchically clustered 28 influenza virus protein sequences containing hemagglutinin(HA)neuraminidase(NA).
Keywords/Search Tags:Protein Sequence, Factor Model, Fourier Power Spectrum, Correlation Analysis, Cluster Analysis
PDF Full Text Request
Related items