Parallel Division Clustering Optimization Algorithm Based On Spark

Posted on:2024-08-02

Degree:Master

Type:Thesis

Country:China

Candidate:D J Gan

Full Text:PDF

GTID:2568307124471644

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Clustering algorithm is an unsupervised learning algorithm,which can divide data sets into different categories according to similar characteristics of data.Objects in the same category have certain similarities,while objects in different categories have great differences,so as to discover the potential distribution pattern of sample data.In the clustering algorithm,as the most representative of the partition clustering algorithm,K-Means algorithm and FCM algorithm is more by virtue of its simple clustering idea,good clustering flexibility,high clustering feasibility characteristics,people pay wide attention to.With the advent of 5G,data scale has exploded.Compared with traditional data,big data has such basic characteristics as large data scale,diverse types of data,low density of data value and fast growth rate of data.However,the time complexity of traditional partition clustering algorithm increases geometrically when dealing with big data.Therefore,how to make partition clustering algorithm to process big data faster is the focus of attention at home and abroad.With the advent of 5G,data scale has exploded.Compared with traditional data,big data has such basic characteristics as large data scale,diverse types of data,low density of data value and fast growth rate of data.Therefore,faced with large data sets with high dimensional characteristics,the traditional partition clustering algorithm has a great challenge.At present,the parallel partition clustering algorithm was proposed to solve the problem of high cost of traditional partition clustering algorithm.However,the traditional partition clustering algorithm still has the following problems when dealing with big data:(1)How to improve the clustering effect of the partition clustering algorithm in the big data environment.(2)How to effectively accelerate the computing efficiency of partitioning clustering algorithm in the big data environment.(3)How to effectively improve the parallelization performance of partition clustering algorithm under Spark.In view of the above problems,on the basis of studying the partition clustering algorithm and Spark and other related knowledge,respectively aiming at the existing problems of the partition clustering algorithm K-Means algorithm and FCM algorithm,the following is proposed:(1)A parallel division clustering algorithm based on Spark and ASPSO,named PDC-SFASPSO.(2)A Parallel FCM clustering algorithm based on Spark and FA,named SPPFCM.The main research work of these two improved algorithms is as follows:(1)Parallel K-Means partition clustering algorithm based on Spark and ASPSOA parallel partition clustering algorithm based on Spark and ASPSO strategy,PDCSFASPSO,is proposed to solve the problems of large data dispersion coefficient and poor anti-interference,difficult to determine the number of local clusters,randomness of local cluster centroid and low efficiency of local cluster parallel merging.Firstly,a grid partitioning strategy based on(PCCV)is proposed to obtain grid cells with small data dispersion coefficient and filter outliers,which reduces the data dispersion coefficient and enhances the anti-interference performance of the algorithm.Secondly,a meshing strategy based on the(PFGF)is proposed to obtain the number of local clusters.Thirdly,ASPSO strategy based on particle swarm optimization is proposed to obtain local cluster centroid and solve the randomness problem of local cluster centroid.Finally,the clustering strategy based on cluster radius and neighbor node(CRNN)was proposed to parallelize clusters with high similarity,which improves the efficiency of local cluster parallelization.The experimental results show that the PDC-SFASPSO algorithm has good performance in data partitioning and clustering under the environment of big data,and is suitable for parallel clustering of large-scale data sets.(2)A Parallel FCM clustering algorithm based on Spark and FA.a parallel FCM clustering algorithm SP-PFCM based on Spark and FA,named SP-PFCM is proposed to solve the problems of parallel FCM clustering algorithm in big data environment,such as excessive data feature redundancy,unbalanced subspace node load,sensitive initial cluster center and low efficiency of local cluster parallelization.Firstly,the algorithm proposed a space partitioning strategy(BSMB)based on multidimensional scaling analysis and BSP tree to obtain high information data features,avoiding excessive redundancy of data features,and combined with BSP tree for data space partitioning,to ensure the load balance of subspace nodes.Then,an initialization strategy based on(FA)optimization algorithm(ISFA)was proposed to obtain the preselected points of the cluster center,and the initial cluster center was obtained by iterating the preselected points combined with FA optimization algorithm,which solved the sensitive problem of the initial cluster center of the algorithm.Finally,a local cluster merging strategy(LCS)based on cluster similarity method was proposed to obtain the intersection nodes between clusters,and a cluster similarity measurement function(CS)was designed to judge the intersection nodes.By combining Spark parallel computing framework,the parallel merging of each data cluster was carried out,which improved the efficiency of the parallel merging of local clusters.The experimental results show that the SP-PFCM algorithm has good clustering effect and performance for parallel FCM clustering in the big data environment,and is suitable for parallel clustering of large-scale data sets.

Keywords/Search Tags:

Spark framework, parallel division clustering, K-Means, ASPSO

PDF Full Text Request

Related items

1	Research On Optimization And Parallel Of K-means Algorithm On Spark
2	Optimization And Application Of K-means Clustering Algorithm Based On Spark Framework
3	The Research On Parallel Computing Technology In Precise Agricultural Climate Division
4	Research On Spark Oriented Fuzzy C-means Clustering Algorithm
5	Research On Parallel Clustering Algorithm For Large - Scale Data Set
6	Parallelizing K-means-based Clustering On Spark
7	Research Of The Clustering Algorithm Based On The Spark
8	Research And Improvement Of Big Data Parallel Clustering Algorithm Based On Spark
9	Research On Density Peak-based Clustering Algorithm And Its Parallel Implementation
10	Optimization Of K-means Clustering Algorithm And Its Implementation On Spark Streaming