Font Size: a A A

Model-Based High-Dimensional Data Clustering Methods:A Review

Posted on:2020-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:C QinFull Text:PDF
GTID:2417330572976089Subject:statistics
Abstract/Summary:PDF Full Text Request
With the development of computer,Internet,big data and artificial intelligence,more and more high-dimensional data appeared.Typical high-dimensional data include portfolio analysis,credit default analysis in financial investment,image recognition in the computer field,gene expression data in the biology and so on.The reason for the emergence of highdimensional data is to try to express more information.For example,in portfolio analysis,there are often many choices and decision-making methods.Each choice and decisionmaking method has its own benefits and risks,research institutions take more and more variables into account and try to more accurately characterize the risk-return model.Highdimensional data usually exist in different low-dimensional subspaces hidden in the original feature space.Clustering of these high-dimensional data has become a hot topic.Conventional clustering analysis methods usually consider all data attributes into account,however,various problems will arise with the increase of data dimension,such as the problem of sample size,as well as the zero gap,dimensional validity,correlation dimension and so on.These problems that are difficult to deal with by traditional clustering analysis methods are collectively referred to as "curse of dimensionality".How to effectively solve the impact of " curse of dimensionality " has been a hot topic of academic research in recent years.In this paper,the clustering problem of high dimensional data is analyzed from the technical point of view.Firstly,the clustering algorithm based on model is described comprehensively,then the curse of dimensionality problem will be introduced.The most common method to solve high dimensional space clustering is dimensionality reduction.Next,the paper introduces four classical linear dimensionality reduction algorithms,including PCA and MDS,and four nonlinear dimensionality reduction algorithms,including KPCA and ISOMAP.The disadvantage of these traditional dimensionality reduction algorithms is that they all reduce the original feature space to the same subspace globally,and do not take into account the following clustering tasks,which may lose useful information features and thus destroy the original clustering structure.In recent years,the subspace clustering technology is put forward to overcome the limitations of the traditional methods,the subspace clustering algorithms try to cluster on different subspaces of the data set,and realize data dimensionality reduction while clustering,which not only improves the classification accuracy,but also effectively solve the problem of curse of dimensionality.The paper introduces some subspace clustering algorithms for modelbased clustering,including MFA,EPGMM,HD-GMM and DLM.The MFA model is mainly obtained by combining the gaussian mixture model with factor analysis,which can realize the effect of dimensionality reduction and clustering at the same time.EPGMM model needs to introduce a modified factor analysis covariance structure on the basis of mixed factor analysis model.By limiting some aspects of this structure,a series of submodels are derived.HD-GMM model is no longer based on factor analysis,but uses the idea of combining the subspace clustering method and the parsimonious gaussian mixture model to achieve the effect of both clustering and dimensionality reduction.DLM model fits the data in a potential orthonormal discriminant subspace.The eigendimension of this subspace is less than the dimension of the original space and the subspace is the same for all class clusters.By constraining the model parameters within and between groups,DLM also deduced a series of sub-models to adapt to different situations.Finally,the example on real data sets shows that the subspace clustering algorithm is more suitable for small sample data with high dimension than the traditional method.Firstly,there is a linear relationship between the parameters of the covariance structure of the subspace clustering algorithm and the original dimension.Secondly,the subspace clustering algorithm can find out the different low-dimensional subspaces hidden in the original feature space and improve the classification accuracy.When the same lowdimensional subspace occurs,the dimensionality reduction strategy of the subspace clustering algorithm will degenerate into the global dimensionality reduction algorithm.
Keywords/Search Tags:Cluster Analysis, Gaussian Mixture Model, curse of dimensionality, Dimension Reduction, Subspace Clustering
PDF Full Text Request
Related items