| Cluster analysis is an unsupervised machine learning technology.It is an important means to obtain information and knowledge from unlabeled datasets.It has been widely used in customer recommendation,pattern segmentation,video image processing and many other fields.As a partition-based clustering algorithm,K-means algorithm is widely used in many fields of cluster analysis due to its advantages such as wide range of application and strong scalability.However,the random selection of the initial clustering center results to the poor accuracy and the unstable clustering results of the K-means algorithm.There may be large differences between the clustering results of the same dataset while running K-means algorithm multiple times.Clustering validity index(CVI)is the most commonly used method to evaluate the clustering results generated by clustering algorithms.The clustering validity index evaluates the quality of the clustering results based on information such as the tightness within the cluster and the separation between different clusters.At present,many clustering validity indexes have been proposed.However,most of the existing clustering validity indexes have shortcomings such as poor stability of the clustering results and unable to effectively reflect the quality of the clustering results of the real dataset.In order to deal with the above problems,we firstly improved the traditional K-means algorithm,and then proposed a new clustering validity index: CSI index.In general,the main work of this thesis is as follows:(1)Aiming at the problem of random selection of initial clustering center points in traditional K-means clustering algorithms that lead to unstable results and fall into local optimal solutions,an improved K-means clustering algorithm is proposed to optimize the selection of clustering center points : DT-Kmeans algorithm.The algorithm determines the neighborhood parameter Eps based on the Euclidean distance between the data point and the t-th nearest neighbor of each data points in the dataset,and then statistics the data point density based on the domain parameter Eps.In the initial clustering center selection phase,the DT-Kmeans algorithm randomly selects the first clustering center,and the remain clustering center selection will be based on the data point density information and the distance between the data point and the existing clustering center points.(2)We proposes a new clustering validity index: the CSI index.The newly proposed CSI index is applicable to evaluate the clustering results of the dataset according to the distance within the cluster and the separation between clusters.By weighting these two parameters and using a linear combination to balance the relationship between them,the index can stably evaluate the clustering results of many datasets.(3)Multiple simulated datasets and real datasets were used to experimentally test the new proposed DT-kmeans clustering algorithm and CSI clustering validity index.The experimental results show that the DT-kmeans algorithm has higher clustering quality than the traditional Kmeans algorithm、K-medoids algorithm、and K-means++ algorithm.At the same time,the stability of the clustering results is significantly improved compared with other algorithms.Compared with other five existing clustering validity indexs(COP index、CSP index、DBI index、DI index and I index),the CSI index can more accurately evaluate the clustering quality of the dataset,and the range of application has also been expanded. |