The Analysis And Research Of Data Mining Classification Algorithm Based On Hadoop Platform

Posted on:2017-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Li

Full Text:PDF

GTID:2308330488497100

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the Internet and other technologies development, the total amount of data and the type will be richer and richer. To collect, analyze and apply these rich data, is the main tendency for current and future data development. Among them, the effective, fast and accurate data classification is the first task to be solved. Traditional data mining classification algorithms are often unable to be processed with large scale data quickly and effectively. Hadoop as one of the best cloud computing platform, can take efficient, fast and reliable processing for massive data.In this paper, the related concepts of Hadoop platform, data mining and its classification are described in detail. And then this paper analyzes the support vector machine(SVM) algorithm,K-nearest neighbor(KNN) algorithm and Naive Bayesian(NB) algorithm of these three kinds of excellent performance data mining classification algorithm deeply. Because of their various deficiencies, the results of the classification can not reach the ideal state. So in this paper, the three classification algorithms are analyzed, and improved by changing the calculation method and adding weight coefficient etc. By fusing the advantages of various algorithms and abandoning their shortcomings, SVM_KNN classification algorithm and SVM_WNB classification algorithm are proposed to solve the problem of data processing. On this basis, this paper introduces the feasibility and idea of parallel algorithm, and two improved algorithms proposed are processed in parallel on the Hadoop cloud computing platform, which makes the algorithm can deal with the huge data.Finally, experiments show that the processing time and accuracy of the algorithm with paralleling processing have been improved greatly when processing massive data. Their speed-up ratio is also increased gradually. So it can be concluded that the new algorithm can be used to deal with large data, and it can be expected that the classification effect will be improved significantly.

Keywords/Search Tags:

Data Mining, SVM_KNN algorithm, SVM_WNB algorithm, Hadoop, parallelization

PDF Full Text Request

Related items

1	Research On The Parallelization Of Decision Tree Algorithm Based On YARN Framework
2	The Study Of Decision Tree Algorithm Based On Hadoop Platform
3	Research On Key Technologies Of Data Mining In Manufacturing Execution System
4	Research On Parallelization Of Dynamic K-means Algorithm In Remote Sensing Image Mining
5	Research And Application Of Association Mining Algorithm In IDS In Hadoop Framework
6	Research On Parallel Data Mining Algorithm Based On Hadoop
7	The Research And Implement Of Data Mining Algorithms Based On Hadoop
8	Improvement Of Collaborative Filtering Recommendation Algorithm And Its Parallelization On Hadoop Platform
9	Medical Insurance Data Mining Based On The Hadoop Platform
10	The Research On Data Mining Algorithmâ€™s Paralleliation Based On Hadoop2.0