Font Size: a A A

Penalized Gaussian Mixture Model-Based High-dimensional Data Clustering

Posted on:2017-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:G J ZhuFull Text:PDF
GTID:2180330503461410Subject:Mathematics and probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This paper devote to the clustering of “high dimension, low sample size” data, assuming that the data are drawn from Gaussian Mixture Model with each component corresponding to a cluster, the variables are selected in clustering procedure, i.e., the variables contain important information are verified, thereafter the data are clustered based on these information variables. Based on Gaussian Mixture Model with penalty function, the clustering procedure and variable selection are explored. There three kinds of penalty function, L1- penalty, Adaptive-L1- penalty, Adaptive hierarchically penalty, upon the global mean are investigated, respectively, which induce the three modelsL1-GMM, Adaptive-L1-GMM, Adaptive-H-GMM. The Gap Statistics is used to estimate the number of clusters, and the EM algorithm for estimating the parameters(s)kp,(s)kpm,(s)ps.Whether a variable is an information variable can be determined throughkpm, and the turning parameter l is given by the modified BIC.Numerical simulated data and real gene expression data are used in the three models respectively. Three models all perform well for numerical simulated data, means that the clustering results and the result of variables selection are consistent with the original data. Whereas for Gene expression data, the performance of the three models are differently, and Adaptive-H-GMM is the best one. In Adaptive-H-GMM, 14 information variables are selected from 300 variables, which reduce the amount of computation and the complexity of model, the error rate of cluster is 4/72, which is accepted.
Keywords/Search Tags:Gap Statistics, BIC, variable selection, Adaptive-H-GMM
PDF Full Text Request
Related items