| Cluster analysis is an important research direction in data mining,and it has important applications in the fields of economy,agriculture,medical treatment,and petroleum exploration.Cluster analysis includes two parts:clustering algorithm and cluster validity index.How to effectively evaluate the clustering results is still a challenging task.Cluster validity indexes can be divided into internal cluster validity indexes and external cluster validity indexes.This thesis studies the external cluster validity indexes.The so-called external cluster validity index refers to the use of category label information when evaluating the clustering results.Many different external cluster validity indexes have been proposed.They can be divided into pair-counting,information theory,and set matching.However,there is still a problem with these cluster validity indexes.The cluster size is used when calculating the index,which will cause different cluster sizes to have different effects on the cluster validity index.In response to this problem,this thesis proposes a class equality cluster validity index,which believes that all classes should be equal,regardless of the number of samples.This thesis compares and analyzes 6 cluster validity indexes on 4 artificial data sets and 19 real data sets,and verifies the validity and superiority of the indexes.The K-means algorithm has the advantages of simple implementation,easy understanding,and fast running.It is the most famous and most widely used algorithm among clustering algorithms.However,the K-means algorithm has an initialization sensitive problem.The so-called initialization sensitive problem means that the Kmeans algorithm needs to specify the initial cluster center,and poor initialization will lead to poor clustering results.Many different initialization algorithms have been proposed,but they still need to be run multiple times to determine the optimal clustering results.To solve the above problems,this thesis proposes a cluster filter K-means,which judges whether the cluster is valid by comparing the density of the cluster center and the cluster edge,and re-clusters the invalid clusters.This thesis compares and analyzes 6 benchmark algorithms on 13 public artificial data sets and 19 real data sets,and verifies the effectiveness and superiority of the algorithms. |