Research On K-prototypes Clustering Algorithm And Data Dimension Reduction

Posted on:2022-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:J C Gu

Full Text:PDF

GTID:2507306509469784

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

With the progress and development of mobile Internet and information technology,the massive data of each industry and field has become very normal,and it also puts forward higher requirements for the ways and methods of analyzing enterprises and big data.Therefore,using machine learning and other new technologies to explore the extensive and diversified relationship between enterprises and big data has become a major challenge in big data analysis.In statistical machine learning,clustering is the most common and effective method to deal with big data.The most classic clustering algorithms are K-means,K-modes and K-prototypes.K-prototypes algorithm is a common clustering algorithm,but because of its limitations,the clustering effect in many cases is not accurate.Aiming at the problem that K-prototypes algorithm is not accurate in processing mixed data,an enhanced K-prototypes mixed data clustering algorithm(EKPCA)is proposed.Firstly,a new distance calculation formula is defined to expand the difference between the data,which is conducive to the reasonable division of cluster edge data.Secondly,more initial prototypes are selected to cover the overall information of the data.Finally,the redundant prototypes are eliminated iteratively to obtain the real classification of the data set.The algorithm is evaluated on 8 UCI datasets,and the experimental results show that EKPCA algorithm has high clustering accuracy.The era of big data is not only the huge number of data,but also the expansion of data dimension.In the study of high-dimensional data,the biggest problem is the dimension expansion,which is often called "dimension disaster".The results of scientific research show that when the processing dimension becomes higher and higher,the complexity of processing high-dimensional data and the number of samples required for processing will show an exponential growth,and the number of samples required for processing space will also show an exponential growth with the increase of dimension.Moreover,many existing clustering theories are only applicable to low dimensional data,and in the case of high-dimensional data,clustering is not suitable The efficiency of clustering and the effect of clustering are also decreased.So it is imperative to reduce the dimension of data.In view of the shortcomings of K-prototypes clustering algorithm in processing high-dimensional data,this paper proposes Subspace Projection K-prototypes Clustering Algorithm(SPKCA).In this algorithm,the idea of subspace projection is introduced.For high-dimensional mixed data,a subspace is first found,and the difference degree between data is projected to the subspace,and then the projected data is clustered.SPKCA algorithm effectively reduces the adverse impact of "dimension disaster" on clustering results.Finally,the experimental data analysis on UCI data shows that SPKCA algorithm has better clustering effect when dealing with data sets with higher dimensions.

Keywords/Search Tags:

K-prototypes, mixed data, distance calculation, initial prototype, eliminated iteratively, high dimensional data, dimension disaster, subspace projection

PDF Full Text Request

Related items

1	Model-Based High-Dimensional Data Clustering Methods:A Review
2	Research On High Dimensional Imbalanced Data Classification In The Identification Of Risk User
3	Interaction Screening Of Ultra-high Dimensional Data Based On Distance Correlation
4	Feature Screening Via Distance Correlation For High-dimensional Interval-censored Data
5	Research On Mixed Data Clustering Algorithm Based On Information Entropy To Define Attribute Weights
6	Research On Calculation Of Statistical Depth Function Under Big Data
7	Research On Fast Calculation Method Of Disaster-affected Population Based On Mobile Phone Signaling Data
8	Variable Selection Of High Dimensional Models With Longitudinal Data
9	Feature Screening For Ultrahigh-Dimensional Survival Data And Outlier Detection
10	Inverse Distance Weighted Support Vector Machine On High-Dimension Low-Sample Size Data And Class-Imbalance Data