The Research On Data Mining Algorithmâ€™s Paralleliation Based On Hadoop2.0

Posted on:2016-08-10

Degree:Master

Type:Thesis

Country:China

Candidate:Z Qu

Full Text:PDF

GTID:2308330461957253

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

Nowadays, a significant revolution of era is undergoing in our society, which is characterized that all walks of life represented by the internet industry are overwhelmingly covered with the big data, especially, the social networks, e-commerce and mobile communications will bring the human beings into a new data information era inâ€™PBâ€™level. At the age of which data is massively produced, shared and applied, the overall solution, which takes the "Cloud Computing" technology as the core, and combines data mining, artificial intelligence and other technologies will be a great power to help us to solve the problems of big data, discover the dataâ€™s value and conquer the data ocean.The Hadoop platform is an open-source distributed system owned by Apache Software Foundation. Itâ€™software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.In recent years, with the lead of the Internet giants, Hadoop has been gradually accepted, attempted and applied in many industries, such as the Internet, Finance, Banking, Education and Government Agencies, consequently, Hadoop has become the most popular cloud computing platform in the big data processing field.As the most two widely used algorithms in data mining, the Decision Tree algorithm and k-means algorithm are able to dig the hidden, unknown and useful information and knowledge from the original data, which enables people to take advantage of the immense value from data in a better way. In this paper, Establishing in an existing cloud computing platform, we design a parallel method of data mining algorithm based on the Hadoop2.0. By the way of transplanting serial mining algorithm into the Hadoop platform, the problem that traditional data mining technology was unable to dig effectively when it faced mass data can be solved.First of all, the paper introduces the two technical backgrounds of research contents: Cloud Computing and Data Mining. Then, combines these two aspects and brings out the ideas of parallel data mining algorithm which is based on cloud computing platform. After that, the paper deeply studies and researches the structural principles of Hadoop2.0 and its internal implementation details. Then on this basis, the paper emphatically analyzes the two types of data mining algorithm:Decision Tree classification algorithm and K-Means Clustering algorithm, according to their optimal algorithm, SPRINT and canopy, designs Hadoop2.0 based parallelized scheme, and describes the schemeâ€™s implementation steps in detail. Finally, examines the performance of Hadoop2.0-based parallel data mining algorithm through experiment.

Keywords/Search Tags:

Hadoop, Data mining, Decision tree, K-means, Parallelization

PDF Full Text Request

Related items

1	The Study Of Decision Tree Algorithm Based On Hadoop Platform
2	Research And Implementation Of Big Data Analysis And Mining Technology Based On Hadoop In Telecommunications Industry
3	Research On The Parallelization Of Decision Tree Algorithm Based On YARN Framework
4	The Research On Decision Tree Algorithm's Parallelization Based On Hadoop Platform
5	The Research Of Decision Tree Mining Based On Hadoop
6	K-means Based On Binary And Svm Decision Tree Algorithm Of Data Mining Research
7	The Parallel Reseach On Decision Tree Classification Algorithm Based On Hadoop
8	Improvement Of Decision Tree Algorithm Based On Hadoop And Research On Classification And Prediction Of Forestry Data
9	Behavior Data Mining And Analysis System For Campus Big Data
10	The Research On Classification And Regression Tree’s Parallelization Based On Spark Platform