| In real life,there are a large number of mixed data composed of numerical and sub-types.As one of the clustering analysis algorithms,the K-Means algorithm is only suitable for log-numerical data mining and analysis.When it comes to mixed data,it can’t do much.After scholars’ in-depth research on hybrid data clustering algorithms,a K-Prototypes algorithm that can handle hybrid data is proposed.It has the advantages of being simple,efficient,and highly scalable like the K-Means algorithm,but it is also vulnerable to problems such as the random selection of the initial center point,the manual designation of the number of clusters,and the inaccurate measurement of mixed attribute dissimilarity.In view of the shortcomings of the K-Prototypes algorithm,this paper makes related improvements to the K-Prototypes algorithm,and integrates the algorithm into the Spark framework to improve the parallel computing capability of the algorithm when processing large-scale data sets.The innovative work of this paper is as follows:First,in view of the problem that only Euclidean distance is not used in the calculation of distance in the mixed attribute dissimilarity measure of the K-Prototypes algorithm,the morphological similarity distance is used on the basis of the fuzzy center of the classification attribute and the information entropy to improve the dissimilarity measure of the mixed attribute MSD instead of using Euclidean distance as part of the distance calculation of the mixed attribute dissimilarity measure.Through examples,it is proved that MSD has better data partitioning than Euclidean distance,and the effectiveness of using MSD to improve the mixed attribute dissimilarity measurement formula is also obtained.Second,in view of the problem that the randomness of the initial center point position of the K-Prototypes algorithm is easy to fall into the local optimum,and the number of clusters k needs to be manually specified,the improved mixed attribute dissimilarity measurement formula is applied to the dissimilarity matrix idea for selection Initial cluster center,then use the numerical attribute and the weight of classification attribute to optimize the internal validity evaluation index CUM,and calculate the CUM index of different k values to select the appropriate number of cluster centers,and finally the simulation experiment on the UCI data set It shows that the improved K-Prototypes algorithm is better than the comparison algorithm in the three external effectiveness evaluation indicators and the improved internal effectiveness evaluation CUM indicator.Third,for the improved K-Prototypes algorithm in the large-scale data set to calculate the mixed attribute dissimilarity,the computing resource energy consumption is high,the running time is too long,the Apache Spark parallel computing framework is introduced,and an Information-entropy and Spark is proposed.Parallelized KPrototypes(ISPK-Prototypes)algorithm.Through comparative analysis on the validity evaluation index,number of working nodes and speedup ratio on the XX Province entrepreneurial loan user data set,it is proved that the ISPK-Prototypes algorithm proposed in this paper is better than the other two algorithms in the parallel computing framework.And can maintain good parallel computing performance. |