Research On Clustering Algorithm Based On Subspace

Posted on:2018-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:J Luo

Full Text:PDF

GTID:2348330518486559

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of related fields such as life science,mobile communication,e-commerce,social networks and so on,a large number of high-dimensional data have emerged.How to effectively cluster the high-dimensional data has become a hot spot and difficulty to researchers.Traditional clustering analysis usually takes all the attributes of the data into account,but the high-dimensional data often contains many unrelated redundant attributes which make the sample data points close to each other with no possibility to find clusters in the entire feature space.The subspace clustering method tries to cluster on different subspaces of the same data set,which can effectively solve these problems.According to the difference of the weighted way,the existing algorithms can be divided into two methods: hard subspace clustering and soft subspace clustering.In this paper,the subspace clustering algorithm is deeply researched from these two perspectives.The main work is as follows:(1)A hard subspace clustering algorithm SUBCLU searches for the maximum interesting subspace clusters by using bottom-up search strategy.During the iteration process,a lot of intermediate clusters have been produced along with much time to be consumed.Against this problem,this paper proposed an improved algorithm called BDFS-SUBCLU.BDFS-SUBCLU uses the deep-first search with back-trace to mine the clusters in the maximum interest subspace.Through this strategy,the generation of intermediate clusters is avoided and the time complexity of the algorithm is reduced.At the same time,the algorithm adds a constraint to the core point in the subspace which avoids the case that the adjacent clusters merge to one affected by some special data points in the clustering process.The experiments conducted on the simulation data set and the real data set show that the efficiency and accuracy of the BDFS-SUBCLU algorithm are improved compared with the SUBCLU algorithm.(2)The soft subspace clustering algorithms based on the framework of k-means algorithm are sensitive to the initial clustering centers for the most part.For improper initial cluster centers,it will trap into local optimum prematurely.Focus on this problem,this paper,based on the original algorithm,verifies whether the algorithm traps into local optimal through the feedback.If the algorithm does trap into local optimum,current results are treated as optimal temporarily.While feedback test is continuously conducted until we cannot find better.Meantime this paper also uses contrastive group to increase the possibility of jumping out of local optimum.The experimental results on the UCI real data set show that the accuracy of the improved FSC and EWKM algorithms is improved.(3)The open source Chinese word segmentator mmseg4 j is applied to the process of Chinese word segmentation,and this paper uses the Vector Space Model to express the texts into the digital matrix.Finally,this paper applies the soft subspace clustering algorithm in texts clustering.

Keywords/Search Tags:

high-dimensional data, clustering analysis, subspace, SUBCLU, FSC, EWKM, mmseg4j

PDF Full Text Request

Related items

1	Research On Subspace Clustering Algorithms For High-dimensional Data
2	Study On High-dimensional Data Subspace Clustering Analysis And Application
3	Research On Subspace Clustering Algorithm For High Dimensional Data
4	Research On Improved Subspace Clustering Algorithm
5	Research On Clustering Algorithms For High-Dimensional Data
6	The Research On Common Subspace Recognition Method For High Dimensional Data
7	Research On Key Technologies Of Clustering High-dimensional Data Based On Sparse Subspace And Their Applications
8	Research On Subspace Clustering Algorithms Based On Density
9	Improvement Research Of Clustering Algorithm Based On High-dimensional Data
10	Research On Clustering Algorithms For High-Dimensional Data