Font Size: a A A

The Analysis And Research Of Data Mining Classification Algorithm Based On Hadoop Platform

Posted on:2017-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiFull Text:PDF
GTID:2308330488497100Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the Internet and other technologies development, the total amount of data and the type will be richer and richer. To collect, analyze and apply these rich data, is the main tendency for current and future data development. Among them, the effective, fast and accurate data classification is the first task to be solved. Traditional data mining classification algorithms are often unable to be processed with large scale data quickly and effectively. Hadoop as one of the best cloud computing platform, can take efficient, fast and reliable processing for massive data.In this paper, the related concepts of Hadoop platform, data mining and its classification are described in detail. And then this paper analyzes the support vector machine(SVM) algorithm,K-nearest neighbor(KNN) algorithm and Naive Bayesian(NB) algorithm of these three kinds of excellent performance data mining classification algorithm deeply. Because of their various deficiencies, the results of the classification can not reach the ideal state. So in this paper, the three classification algorithms are analyzed, and improved by changing the calculation method and adding weight coefficient etc. By fusing the advantages of various algorithms and abandoning their shortcomings, SVM_KNN classification algorithm and SVM_WNB classification algorithm are proposed to solve the problem of data processing. On this basis, this paper introduces the feasibility and idea of parallel algorithm, and two improved algorithms proposed are processed in parallel on the Hadoop cloud computing platform, which makes the algorithm can deal with the huge data.Finally, experiments show that the processing time and accuracy of the algorithm with paralleling processing have been improved greatly when processing massive data. Their speed-up ratio is also increased gradually. So it can be concluded that the new algorithm can be used to deal with large data, and it can be expected that the classification effect will be improved significantly.
Keywords/Search Tags:Data Mining, SVM_KNN algorithm, SVM_WNB algorithm, Hadoop, parallelization
PDF Full Text Request
Related items