Font Size: a A A

Research On K-prototypes Clustering Algorithm And Data Dimension Reduction

Posted on:2022-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:J C GuFull Text:PDF
GTID:2507306509469784Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the progress and development of mobile Internet and information technology,the massive data of each industry and field has become very normal,and it also puts forward higher requirements for the ways and methods of analyzing enterprises and big data.Therefore,using machine learning and other new technologies to explore the extensive and diversified relationship between enterprises and big data has become a major challenge in big data analysis.In statistical machine learning,clustering is the most common and effective method to deal with big data.The most classic clustering algorithms are K-means,K-modes and K-prototypes.K-prototypes algorithm is a common clustering algorithm,but because of its limitations,the clustering effect in many cases is not accurate.Aiming at the problem that K-prototypes algorithm is not accurate in processing mixed data,an enhanced K-prototypes mixed data clustering algorithm(EKPCA)is proposed.Firstly,a new distance calculation formula is defined to expand the difference between the data,which is conducive to the reasonable division of cluster edge data.Secondly,more initial prototypes are selected to cover the overall information of the data.Finally,the redundant prototypes are eliminated iteratively to obtain the real classification of the data set.The algorithm is evaluated on 8 UCI datasets,and the experimental results show that EKPCA algorithm has high clustering accuracy.The era of big data is not only the huge number of data,but also the expansion of data dimension.In the study of high-dimensional data,the biggest problem is the dimension expansion,which is often called "dimension disaster".The results of scientific research show that when the processing dimension becomes higher and higher,the complexity of processing high-dimensional data and the number of samples required for processing will show an exponential growth,and the number of samples required for processing space will also show an exponential growth with the increase of dimension.Moreover,many existing clustering theories are only applicable to low dimensional data,and in the case of high-dimensional data,clustering is not suitable The efficiency of clustering and the effect of clustering are also decreased.So it is imperative to reduce the dimension of data.In view of the shortcomings of K-prototypes clustering algorithm in processing high-dimensional data,this paper proposes Subspace Projection K-prototypes Clustering Algorithm(SPKCA).In this algorithm,the idea of subspace projection is introduced.For high-dimensional mixed data,a subspace is first found,and the difference degree between data is projected to the subspace,and then the projected data is clustered.SPKCA algorithm effectively reduces the adverse impact of "dimension disaster" on clustering results.Finally,the experimental data analysis on UCI data shows that SPKCA algorithm has better clustering effect when dealing with data sets with higher dimensions.
Keywords/Search Tags:K-prototypes, mixed data, distance calculation, initial prototype, eliminated iteratively, high dimensional data, dimension disaster, subspace projection
PDF Full Text Request
Related items