Font Size: a A A

The Research On Data Mining Algorithm’s Paralleliation Based On Hadoop2.0

Posted on:2016-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z QuFull Text:PDF
GTID:2308330461957253Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Nowadays, a significant revolution of era is undergoing in our society, which is characterized that all walks of life represented by the internet industry are overwhelmingly covered with the big data, especially, the social networks, e-commerce and mobile communications will bring the human beings into a new data information era in’PB’level. At the age of which data is massively produced, shared and applied, the overall solution, which takes the "Cloud Computing" technology as the core, and combines data mining, artificial intelligence and other technologies will be a great power to help us to solve the problems of big data, discover the data’s value and conquer the data ocean.The Hadoop platform is an open-source distributed system owned by Apache Software Foundation. It’software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.In recent years, with the lead of the Internet giants, Hadoop has been gradually accepted, attempted and applied in many industries, such as the Internet, Finance, Banking, Education and Government Agencies, consequently, Hadoop has become the most popular cloud computing platform in the big data processing field.As the most two widely used algorithms in data mining, the Decision Tree algorithm and k-means algorithm are able to dig the hidden, unknown and useful information and knowledge from the original data, which enables people to take advantage of the immense value from data in a better way. In this paper, Establishing in an existing cloud computing platform, we design a parallel method of data mining algorithm based on the Hadoop2.0. By the way of transplanting serial mining algorithm into the Hadoop platform, the problem that traditional data mining technology was unable to dig effectively when it faced mass data can be solved.First of all, the paper introduces the two technical backgrounds of research contents: Cloud Computing and Data Mining. Then, combines these two aspects and brings out the ideas of parallel data mining algorithm which is based on cloud computing platform. After that, the paper deeply studies and researches the structural principles of Hadoop2.0 and its internal implementation details. Then on this basis, the paper emphatically analyzes the two types of data mining algorithm:Decision Tree classification algorithm and K-Means Clustering algorithm, according to their optimal algorithm, SPRINT and canopy, designs Hadoop2.0 based parallelized scheme, and describes the scheme’s implementation steps in detail. Finally, examines the performance of Hadoop2.0-based parallel data mining algorithm through experiment.
Keywords/Search Tags:Hadoop, Data mining, Decision tree, K-means, Parallelization
PDF Full Text Request
Related items