| As an important branch of data mining,clustering is widely used in image recognition,natural language processing,recommendation system,correlation analysis and many other fields.Traditional clustering algorithm is mainly to mine effective information in the form of single machine.With the continuous growth of data volume,this method will be limited by computer performance,memory and other aspects,and it is difficult to meet the clustering requirements of massive data today.Distributed computing framework provides a relatively effective solution for traditional stand-alone clustering algorithm,which uses unified cluster to improve the computing and storage capacity of the machine,so that clustering algorithm is no longer limited by data size,computing resources and other restrictions.However,the present distributed clustering algorithm often has the following two problems when dealing with high-dimensional massive data scenarios: first,the final clustering result is of low accuracy when dealing with high-dimensional data;second,when calculating large-scale data,the algorithm takes a long time on average and is inefficient.In order to solve the problems of the existing algorithm,this paper use data dimension reduction technology,grid clustering thought and Spark a distributed platform,this paper proposes a distributed clustering algorithm based on adaptive meshing,the algorithm has the ability of dealing with high dimensional data,on the standard data and real data sets,both has good clustering results and high efficiency.In addition,a complete urban hotspot area mining system has been built,which can effectively mine hotspots in Kunming based on GPS taxi tracks and weibo check-in data.The main work of this thesis is as follows:(1)For high-dimensional data sets such as images and texts,a dimension reduction method based on decision graph and linear discriminant analysis is designed and implemented to preprocess the overall data,and the high-dimensional data is mapped to the low-dimensional space,which effectively reduces the subsequent computation.(2)Based on Spark distributed computing platform,adaptive grid partitioning and multi-stage cell allocation,an improved distributed grid clustering algorithm is proposed.The algorithm has good processing ability on multi-form and multi-dimension data sets.(3)Set up a complete set of automation system for mining urban hotspots,including data preprocessing,feature engineering,data mining and data visualization modules.The system is managed based on Docker container,which is easy to expand and transplant.The research results of this thesis can be applied to urban planning,correlation analysis and other fields to help people solve the problem that it is difficult to mine effective information in high-dimensional mass data scenarios,which has good research value and application prospect. |