| Due to the continuous innovation of technology,the continuous evolution of business models,and the increasingly prominent demand for large-scale data processing,how to effectively analyze and process large-scale data sets to extract value has become a topic of concern in the early 21st century.Therefore,how to effectively analyze and process large data,how to improve and extend classical algorithms to serve large data analysis,these problems are particularly important in the context of the era of large data.This paper is an improvement in this direction.Among many fuzzy clustering algorithms,the Fuzzy C-Means(FCM)algorithm is the most widely used one.The method of determining the attribute category of sample points is to obtain the membership degree of each sample point to all the class centers by optimizing the objective function,so as to cluster the sample data.This solution makes FCM algorithm get better clustering results than other fuzzy clustering algorithms even for data samples which are difficult to cluster.The research scheme of this paper is mainly based on theoretical basis and practical experiments.The common single-machine environment and Spark environment are compared from the aspects of application characteristics and models.The performance differences between the two architectures in iterative learning tasks are theoretically analyzed and compared,and the conclusion that Spark has more advantages in iterative performance is drawn.Then,the parallelization of the fuzzy c-means algorithm based on Spark platform is discussed,and the algorithm is improved by utilizing the special functions of Spark platform.The robustness of the algorithm after parallel computing is also improved to a great extent.Aiming at the problem that the clustering ability of the algorithm is defective on the non-linear data,partitioning method and feature weighting method are used to make the non-linear data clustering effectively.Based on the FCM algorithm,the Canopy algorithm is fused,which can solve the initialization problems in the algorithm,such as the initialization of clustering centers and the initialization of distance matrix.The efficiency and performance of the improved FCM algorithm have been greatly improved through the above,and the optimized algorithm is named SCWGIFP-FCM.In order to prove the validity of SCWGIFP-FCM algorithm,this paper takes Anuran data set,Gesture Phase data set,3D_spatial_network data set and MoCap Hand Postures data set in UCI data set as test data,compares their running results with traditional FCM algorithm,and uses PC index as clustering quality evaluation criterion,and proves the effectiveness of the optimized algorithm in experiments.Sex and availability.Based on the quality and efficiency detection of the algorithm,the optimized algorithm is applied to airline customer data mining to solve practical problems. |