Font Size: a A A

Application Of Model-Based Clustering In Protein Classification

Posted on:2020-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y S CaoFull Text:PDF
GTID:2370330572980288Subject:statistics
Abstract/Summary:PDF Full Text Request
Cluster analysis,as an important part of data mining,plays an important role in various research fields,and has been paid more attention in recent years.It is very necessary to understand and master its principle and apply it properly in various data analysis.As an important clustering method,model-based clustering has been widely used in clustering applications,including text clustering,handwriting recognition,image segmentation and other fields.Bioinformatics has been greatly developed in recent years.As an important part of bioinformatics,the study of protein sequence data has become the focus of attention.In view of this,this paper applies the model-based clustering to the classification of protein sequences,in order to make up for the shortcomings of predecessors and inject new ideas.Firstly,this paper reviews the development of model-based clustering and the research results of domestic and foreign scholars.Then the theoretical knowledge of cluster analysis is emphasized: the meaning of cluster analysis is briefly introduced;the connotations of classical clustering algorithm,advanced clustering algorithm and multi-source data algorithm commonly used at present are summarized in detail;the theoretical knowledge of mixture models,expectation-maximization algorithm of parameter estimation and model selection criteria involved in model-based clustering are explained emphatically;and the advantages and limitations of the model-based clustering are briefly analyzed;finally,the practical application of the current cluster analysis is summarized.Next,using a variety of specific methods,different models are established and parameters are estimated to classify protein sequences.The theory of cluster analysis is applied to a specific example of predicting the cellular localization sites of proteins: the data set contains 1484 yeast amino-acid sequences,8 attribute variables and 10 specific localization sites.The problem is processed and analyzed by k-means method and model-based clustering.The k-means method is illustrated by 5 clusters and 8 clusters.It uses mclust,HDclassif(hddc)and Rmixmod packages in R software of model-based clustering to calculate and get the results.Finally,combining the theoretical results with the actual meaning,the results of different methods are discussed and evaluated in detail.Firstly,it can be clearly seen that the model-based clustering for protein sequence classification can achieve better results: clear classification,significant differences between different types,more representative meaning of the classes;there is clear theoretical support in the number selection of clusters;each class is represented by probability form,and the characteristics of each class can be expressed by corresponding parameters,which can transform the category problem into the problem of optimization model,and can better apply the idea and method of statistics and provide a new way of thinking when we study the nature of each category more professionally.Secondly,compared with k-means method,model-based clustering also has clear advantages: it makes up for the problem that k-means method can't determine the number of clusters,and the representativeness of classification results is more clear and reasonable.Thirdly,in order to select the most suitable clustering algorithm from the model-based clustering,three different packages are used.By comparing the results of these three algorithms,for this specific problem,the results of mclust and Rmixmod are more reasonable than HDclassif(hddc),and mclust is easier to operate and understand for beginners.It can be said that model-based clustering provides a new way of thinking and development direction for the research of related issues,and it is believed that it will have a good application prospect in the field of bioinformatics.
Keywords/Search Tags:Cluster analysis, Model-based clustering, Proteins, Bioinformatics
PDF Full Text Request
Related items