Research On K-Prototypes Algorithm Based On Mixed Data And Implementation Of Spark Platform

Posted on:2022-02-12

Degree:Master

Type:Thesis

Country:China

Candidate:Q Wang

Full Text:PDF

GTID:2518306539981199

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In real life,there are a large number of mixed data composed of numerical and sub-types.As one of the clustering analysis algorithms,the K-Means algorithm is only suitable for log-numerical data mining and analysis.When it comes to mixed data,it can’t do much.After scholars’ in-depth research on hybrid data clustering algorithms,a K-Prototypes algorithm that can handle hybrid data is proposed.It has the advantages of being simple,efficient,and highly scalable like the K-Means algorithm,but it is also vulnerable to problems such as the random selection of the initial center point,the manual designation of the number of clusters,and the inaccurate measurement of mixed attribute dissimilarity.In view of the shortcomings of the K-Prototypes algorithm,this paper makes related improvements to the K-Prototypes algorithm,and integrates the algorithm into the Spark framework to improve the parallel computing capability of the algorithm when processing large-scale data sets.The innovative work of this paper is as follows:First,in view of the problem that only Euclidean distance is not used in the calculation of distance in the mixed attribute dissimilarity measure of the K-Prototypes algorithm,the morphological similarity distance is used on the basis of the fuzzy center of the classification attribute and the information entropy to improve the dissimilarity measure of the mixed attribute MSD instead of using Euclidean distance as part of the distance calculation of the mixed attribute dissimilarity measure.Through examples,it is proved that MSD has better data partitioning than Euclidean distance,and the effectiveness of using MSD to improve the mixed attribute dissimilarity measurement formula is also obtained.Second,in view of the problem that the randomness of the initial center point position of the K-Prototypes algorithm is easy to fall into the local optimum,and the number of clusters k needs to be manually specified,the improved mixed attribute dissimilarity measurement formula is applied to the dissimilarity matrix idea for selection Initial cluster center,then use the numerical attribute and the weight of classification attribute to optimize the internal validity evaluation index CUM,and calculate the CUM index of different k values to select the appropriate number of cluster centers,and finally the simulation experiment on the UCI data set It shows that the improved K-Prototypes algorithm is better than the comparison algorithm in the three external effectiveness evaluation indicators and the improved internal effectiveness evaluation CUM indicator.Third,for the improved K-Prototypes algorithm in the large-scale data set to calculate the mixed attribute dissimilarity,the computing resource energy consumption is high,the running time is too long,the Apache Spark parallel computing framework is introduced,and an Information-entropy and Spark is proposed.Parallelized KPrototypes(ISPK-Prototypes)algorithm.Through comparative analysis on the validity evaluation index,number of working nodes and speedup ratio on the XX Province entrepreneurial loan user data set,it is proved that the ISPK-Prototypes algorithm proposed in this paper is better than the other two algorithms in the parallel computing framework.And can maintain good parallel computing performance.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research On Taxi Trajectory Organization Method Based On Apache Spark
2	Temporal Query Analysis And Temporal Index Optimization Based On Apache Spark
3	OCTWAS - Online Check-pointer for Workflows on Apache Spark
4	Design And Implementation Of A Performance Modeling System On Apache Spark
5	Research And Application Of K-means++ Algorithm Based On Spark Platform
6	Enhanced Singular Collaborative Filtering Based Recommender System On Apache Spark
7	Using apache spark for scalable gene sequence analysis
8	The Research And Implementation Of Movie Recommendation System Based On Flink
9	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
10	Parallel Multi-label Classifier Chains Algorithm Using Apache Spark