Research On K_means Clustering Algorithm Based On MapReduce

Posted on:2017-01-08

Degree:Master

Type:Thesis

Country:China

Candidate:M W Zhang

Full Text:PDF

GTID:2278330485466772

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of internet applications, more and more data is accumulated. How to get valuable information from the massive data more efficiently and quickly, and apply it to the related field, becomes to the urgent problem that needs to be solved urgently in the current range. In order to solve this problem, researchers have put forward more and more clustering analysis algorithms. At present, clustering analysis has been widely used in many fields, such as finance, military, medical, management and so on.K_means clustering algorithm is widely used in clustering analysis, the idea is simple and easy to operate. But the center for random initialization would make the clustering results unstable and prone to local optimal solution. In addition, when there are isolated points in the data set, the clustering results will be affected. With the increasing of clustering data, the number of iterations of the K_means algorithm is increasing and it is serious time-consuming, and the traditional stand-alone operation mode canâ€™t meet the development needs right now. MapReduce is a kind of distributed computing model based on Hadoop platform, and it is a kind of distributed computing framework which is widely used at present. HDFS also implements the distributed storage of files, so the cluster analysis algorithm on a single machine can be transplanted to the Hadoop platform for distributed clustering tasks. Aiming at the shortcomings of the K_means algorithm, in this paper, an algorithm for optimization of K_means algorithm is proposed and its also implementation in parallel.First of all, this paper combed the clustering analysis of the background and research status at home and abroad and analyzes the paperâ€™s main work and innovation points. Secondly introduces the measurement of clustering analysis technology and the division of clustering algorithm, and through the HDFS distributed file system and MapReduce programming model introduced the Hadoop technology. Then, in order to solve the problem of K_means algorithm, an improved initial center point selection algorithm and an isolated points exclusion method are proposed based on the maximum distance method. And the improved K_means algorithm is combined with the features of the MapReduce programming model to make it run on the Hadoop platform. Finally, not only in the stand-alone environment when making comparative experiments prove that this algorithm in clustering quality but also under the condition of parallel environment by accelerating ratio and extension ratio to prove that the parallel algorithm is suitable for processing big data problem.

Keywords/Search Tags:

Clustering Analysis, K_means, Hadoop, MapReduce, HDFS

PDF Full Text Request

Related items

1	Parallel Clustering Algorithm Based On MapReduce
2	The Performance Optimization And Improvement Of MapReduce In Hadoop
3	Meticulous Analysis Based On Hadoop And Its Application
4	Research Of Clustering Algorithm Based On Mahout
5	Design And Implementation Of Hadoop-Based Network Traffic Analysis System
6	Research On Big Data Text Analysis Based On Hadoop Architecture
7	Study On Iterative Mapreduce Computation Model For Clustering Analysis
8	Working Principle And Applied Research Of MapReduce
9	MapReduce Performance Research And Optimization Based On Block Aggregation
10	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform