Font Size: a A A

Research On Parallelization Of Spatial Data Mining Clustering Algorithm Based On SPARK

Posted on:2019-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:D J LiuFull Text:PDF
GTID:2310330548957924Subject:Cartography and Geographic Information System
Abstract/Summary:PDF Full Text Request
In the environment of continuous development of modern spatial data mining technology,K-MEANS algorithm is still a strong application in the field of spatial data mining as a traditional dominant clustering algorithm.The advantage of the K-MEANS algorithm lies in its fast convergence speed,which can be accelerated by distributed operation of the algorithm and can well deal with the effect of noise points on the algorithm.However,the amount of data faced by today's spatial data mining is increasing.With the further expansion of the scale of calculation,however,in some large-scale spatial data mining,due to the computational difficulty brought by the increase of the amount of data,the algorithm The running time increases in proportion.The trend of data volume development exceeds the expectation of mining technology development,leading to supply shortages.The traditional clustering algorithm process is still running in series.In order to solve these problems,parallel algorithms must be implemented with the existing distributed platform.The representative of these platforms is the HADOOP platform.However,these solutions mainly have two problems:(1)HADOOP platform parallel processing platform is expensive,fault tolerance is poor,can not perform complex association operations,and its single distributed framework can easily cause data transmission bottlenecks;(2)Although HADOOP platform provides HDFS,but does not have the corresponding data set services,it can not process multiple iterations when performing arithmetic operations.At this time,multiple operations of data read and write are required,which affects the processing efficiency;(3)HADOOP requires a large number of JOBs to perform complex calculations.The researchers' functions manage dependencies through their own management.The processing steps are cumbersome,and the processing time is prolonged.The nature of the SPARK platform and HADOOP are both a processing engine that uses distributed memory abstraction for data processing and is particularly suitable for large-scale data processing.Among them,RDD is called elastic data set and it is one of the bases of SPARK platform.RDD is the basic operating model of SPARK for data storage.Because RDD has its own attribute of partition,and the data records in its collection are not variable,the data can be directly processed in a distributed framework through two operations,SPARK and RDD.Therefore,the data processing efficiency can be improved,and the above-mentioned main problems on the traditional parallel platform are solved.Therefore,this study is based on the SPARK platform,and concretely analyzes the implementation and basic principles of K-MEANS algorithm in spatial data mining.After researching and combining the relevant services provided by SPARK,it is oriented to spatial data mining,and analyzes the parallel computing ideas and implementation of K-MEANS algorithm..Firstly,the serial algorithm is studied.Based on this,combined with the core advantages of SPARD platform such as RDD and MAPREDUCE operator,an effective parallelization scheme is designed.Make full use of the hardware resources on the device,in-depth study of the implementation of parallel K-MEANS algorithm through clustering,and use the YARN resource manager to parallelize the design of the algorithm and analyze the implementation ideas and methods of parallelization of the K-MEANS algorithm on the platform.The performance comparison between the SPARK platform K-MEANS algorithm serial operation results and the SPARK platform K-MEANS algorithm parallelization operation results are accelerated.The K-MEANS parallel algorithm based on SPARK platform is applied to the analysis of the current economic development in Jiangxi Province.The visual results of parallel K-MEANS algorithm are compared with SPARK platform based on the SPARK platform K-MEANS algorithm;the visual results of parallel K-MEANS algorithm based on SPARK platform are compared with the parallel visualization results of the parallel K-MEANS algorithm to verify the practicality of the proposed parallel algorithm.Sex.Through specific experiments and tests,according to the specific research contents,the parallel K-MEANS algorithm implemented on the SPARK platform is significantly faster than the serial K-MEANS operation rate;through SPARK on YARN deployment,the parallel effect of the K-MEANS algorithm is passed.The data parallel design method can effectively improve the efficiency;the parallel K-MEANS algorithm of the SPARK platform is superior to the visualized results of the parallel K-MEANS algorithm of the MATLAB platform;in the analysis and application of the current economic development status of Jiangxi Province,the research results have been compared and verified.This paper studies the practicality and effectiveness of the content in the analysis of economic development in Jiangxi Province.The advantages of the SPARK platform compared to other technology platforms for practical applications are compared.
Keywords/Search Tags:clustering analysis, K-MEANS, SPARK, Economic development analysis, parallelization
PDF Full Text Request
Related items