| Categorical data clustering includes partition-based clustering,hierarchy-based clustering,density-based clustering,grid-based clustering and model-based clustering.And in partition-based clustering methods,k-modes algorithm,which adopts the simple Hamming Distance to calculate the dissimilarity between the objects and the frequency-based method to recalculate and update cluster centers,is the most classic and widely used.However,firstly,this dissimilarity coefficient and objective function do not give enough attention to the characteristics of categorical data,weaken the intra-cluster similarity and ignore the inter-cluster similarity.Secondly,the random initialization or manually setting of cluster centers adopted by this algorithm brings a lot of uncertainty to clustering results.It is common that mixed data sets contain both numerical data and categorical data in practical applications.In the classic k-prototypes algorithm,the proportion between the categorical data and numerical data is adjusted by the manual setting of parameter γ which has a great influence on the clustering result.Through an analysis of the classic k-modes algorithm and classic k-prototypes algorithm,this thesis makes a summary of the data structure type,data standardization,data type,dissimilarity computing and the classic k-modes algorithm,k-means algorithm and k-prototypes algorithm.Finally,the k-prototypes algorithm is improved.The details are as follows:(1)In order to improve the accuracy of k-modes algorithm and solve the problem of the selection of the initial cluster center,this thesis propose a k-modes clustering algorithm based on the dissimilarity coefficient of the intra-cluster and inter-cluster(IKMCA).IKMAC improves the dissimilarity coefficient according to the similarity between the intra-cluster and inter-cluster and provides a specific method for the self-determined selection of the initial cluster center.This intra-cluster and inter-cluster dissimilarity coefficient not only takes the dissimilarity of the characteristic values themselves into consideration,but also pays attention to their differentiation from other related characteristics.The self-determined selection of the initial cluster center can automatically determine the number of clusters and the location of the initial cluster center.The experiment of IKMCA based on actual UCI data sets shows that IKMCA is superior to the classic k-modes algorithm and its variants in clustering accuracy,purity and recall rate.(2)In order to avoid the feature transform and parameter adjustment between different data types and to address the feature weighting problem in the clustering of high-dimensional data,a categorical dissimilarity coefficient based on entropy weight,a quantized numerical dissimilarity coefficient,and a mixed dissimilarity coefficient for the clustering of mixed data are developed.This dissimilarity coefficient takes the importance of the average of the categorical characteristic values and numerical characteristic values into full consideration,it has a unified criterion and can compute the dissimilarity between the data objects and the clusters more objectively.In addition,a weighted k-prototypes clustering algorithm based on mixed dissimilarity coefficient(WKPCAD)is proposed by applying the weighted mixed dissimilarity coefficient to the classic k-prototypes algorithm.Experiments on real UCI data sets also verify the effectiveness and robustness of WKPCAD. |