| In today's mobile internet era,massive amounts of data are generated every day,and huge values are contained in these data.How to extract valuable information from them has become a very important research topic.Processing such massive amounts of data on traditional computers takes a lot of time and does not meet the needs of today's businesses.In order to solve this problem,this paper combines the traditional single-machine clustering algorithm and the distributed computing platform Spark to parallelize the design implementation and optimization.The CLARANS(CClustering Large Application based upon Randomized Search)algorithm is a widely used clustering algorithm based on partitioning.The algorithm has good robustness and is insensitive to noise(outliers),and the input order of the data does not affect the clustering results.However,this algorithm has a relatively high time complexity and is difficult to handle massive data.At the same time,the algorithm itself has the problems that the number of clusters is difficult to determine,the clustering result depends on the selection of the initial center point,and it is easy to fall into the local optimum.The clustering efficiency of the algorithm is not high and it is difficult to ensure the stability of the clustering.In this paper,the research status of clustering algorithms both at home and abroad is reviewed.Based on this,the research of clustering algorithms in parallelization is discussed.Furthermore,the optimization of clustering algorithms based on genetic algorithms is analyzed.Then it analyzes in detail the main concepts and principles of the Spark computing framework and the principles of the distributed file system HDFS.At the same time,the relevant knowledge points of clustering algorithm is summarized and analyzed,which provides a solid theoretical basis for the follow-up study of the paper.Then,based on the high complexity of the algorithm,a parallel SP-CLARANS algorithm based on the Spark platform is proposed to improve the efficiency of the algorithm and the size of the data set processing by using the memory cluster-based computing speed advantage of the Spark cluster.For the problem that the algorithm is difficult to obtain the global optimum for the sensitivity of the initial center point,a novel SPGA-CLARANS algorithm based on parallelized genetic algorithm is proposed.The algorithm of chromosome coding,selection,crossover and mutation is performed according to the characteristics of Spark.The design and improvement combine the global search capability of the genetic algorithm with the local search capability of the SP-CLARANS algorithm to improve the quality and stability of the improved algorithm.Finally,this paper builds a Spark cluster to perform simulation experiments.It uses the UCI real data set to verify the accuracy and stability of the algorithm in the cluster independent operation mode,and then uses the artificial data sets of different scales to verify the clustering efficiency of the algorithm,and on the large data set.Verify the parallel performance of the algorithm in this paper.The experimental results show that the improved parallel algorithm presented in this paper has higher clustering accuracy,clustering efficiency and parallel performance.It has a certain positive effect on solving the bottleneck problem of traditional clustering algorithm for processing massive data. |