Font Size: a A A

Research On K_means Clustering Algorithm Based On MapReduce

Posted on:2017-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:M W ZhangFull Text:PDF
GTID:2278330485466772Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet applications, more and more data is accumulated. How to get valuable information from the massive data more efficiently and quickly, and apply it to the related field, becomes to the urgent problem that needs to be solved urgently in the current range. In order to solve this problem, researchers have put forward more and more clustering analysis algorithms. At present, clustering analysis has been widely used in many fields, such as finance, military, medical, management and so on.K_means clustering algorithm is widely used in clustering analysis, the idea is simple and easy to operate. But the center for random initialization would make the clustering results unstable and prone to local optimal solution. In addition, when there are isolated points in the data set, the clustering results will be affected. With the increasing of clustering data, the number of iterations of the K_means algorithm is increasing and it is serious time-consuming, and the traditional stand-alone operation mode can’t meet the development needs right now. MapReduce is a kind of distributed computing model based on Hadoop platform, and it is a kind of distributed computing framework which is widely used at present. HDFS also implements the distributed storage of files, so the cluster analysis algorithm on a single machine can be transplanted to the Hadoop platform for distributed clustering tasks. Aiming at the shortcomings of the K_means algorithm, in this paper, an algorithm for optimization of K_means algorithm is proposed and its also implementation in parallel.First of all, this paper combed the clustering analysis of the background and research status at home and abroad and analyzes the paper’s main work and innovation points. Secondly introduces the measurement of clustering analysis technology and the division of clustering algorithm, and through the HDFS distributed file system and MapReduce programming model introduced the Hadoop technology. Then, in order to solve the problem of K_means algorithm, an improved initial center point selection algorithm and an isolated points exclusion method are proposed based on the maximum distance method. And the improved K_means algorithm is combined with the features of the MapReduce programming model to make it run on the Hadoop platform. Finally, not only in the stand-alone environment when making comparative experiments prove that this algorithm in clustering quality but also under the condition of parallel environment by accelerating ratio and extension ratio to prove that the parallel algorithm is suitable for processing big data problem.
Keywords/Search Tags:Clustering Analysis, K_means, Hadoop, MapReduce, HDFS
PDF Full Text Request
Related items