Research And Implementation Of An Improved Distributed Grid Clustering Algorithm

Posted on:2023-12-06

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Wang

Full Text:PDF

GTID:2568306620456024

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As an important branch of data mining,clustering is widely used in image recognition,natural language processing,recommendation system,correlation analysis and many other fields.Traditional clustering algorithm is mainly to mine effective information in the form of single machine.With the continuous growth of data volume,this method will be limited by computer performance,memory and other aspects,and it is difficult to meet the clustering requirements of massive data today.Distributed computing framework provides a relatively effective solution for traditional stand-alone clustering algorithm,which uses unified cluster to improve the computing and storage capacity of the machine,so that clustering algorithm is no longer limited by data size,computing resources and other restrictions.However,the present distributed clustering algorithm often has the following two problems when dealing with high-dimensional massive data scenarios: first,the final clustering result is of low accuracy when dealing with high-dimensional data;second,when calculating large-scale data,the algorithm takes a long time on average and is inefficient.In order to solve the problems of the existing algorithm,this paper use data dimension reduction technology,grid clustering thought and Spark a distributed platform,this paper proposes a distributed clustering algorithm based on adaptive meshing,the algorithm has the ability of dealing with high dimensional data,on the standard data and real data sets,both has good clustering results and high efficiency.In addition,a complete urban hotspot area mining system has been built,which can effectively mine hotspots in Kunming based on GPS taxi tracks and weibo check-in data.The main work of this thesis is as follows:(1)For high-dimensional data sets such as images and texts,a dimension reduction method based on decision graph and linear discriminant analysis is designed and implemented to preprocess the overall data,and the high-dimensional data is mapped to the low-dimensional space,which effectively reduces the subsequent computation.(2)Based on Spark distributed computing platform,adaptive grid partitioning and multi-stage cell allocation,an improved distributed grid clustering algorithm is proposed.The algorithm has good processing ability on multi-form and multi-dimension data sets.(3)Set up a complete set of automation system for mining urban hotspots,including data preprocessing,feature engineering,data mining and data visualization modules.The system is managed based on Docker container,which is easy to expand and transplant.The research results of this thesis can be applied to urban planning,correlation analysis and other fields to help people solve the problem that it is difficult to mine effective information in high-dimensional mass data scenarios,which has good research value and application prospect.

Keywords/Search Tags:

Clustering, High-dimensional mass data, Adaptive meshing, Spark platform, Multistage allocation

PDF Full Text Request

Related items

1	Research On Three-dimensional Point Cloud Data Meshing Technology
2	Optimization And Implementation Of Clustering Algorithms Based On Spark Platform
3	Research On Load Allocation Strategy Based On Data Clustering
4	Research And Implementation Of Clustering Method For High Dimensional Categorical Data
5	Research And Application Of Big Data Clustering Algorithm Based On Spark Platform
6	Research On Clustering Algorithm Based On Subspace In High-dimensional Data Streams
7	Research And Application Of Clustering Algorithm Based On Spark Platform
8	The Researches On Related To Key Technologies Among Clustering Based On High-dimensional Data Space
9	Research And Application Of Rough Clustering Algorithm For High Dimensional Data Sets
10	Research On Clustering Algorithms For High-Dimensional Data