| With the rapid development of computer technology,communication technology and Internet technology,the amount of data in the world has shown explosive growth.How to extract valuable information from the massive data is a difficult problem in the field of data mining.Cloud computing with its powerful computing power and huge storage capacity to solve the problem of massive data mining provides a new ideal.Hadoop is currently the most widely used cloud computing platform,cloud computing solutions from the Apache foundation reference Google cloud,has the advantages of low cost,high availability,high reliability and scalability.Decision tree algorithm is the most easy to understand and is the most common algorithm model in data mining algorithm.However,the traditional decision tree algorithm is run under a single machine,which is controlled by CPU and memory.This paper introduces the Hadoop cloud computing platform,and discusses the two key technologies of HDFS and MapReduce.In this paper,the C4.5 algorithm and SPRINT algorithm of decision tree algorithm are selected as the research object.Firstly,the C4.5 algorithm is improved,and a new method of selecting two layers of information gain rate is proposed(D-C4.5 algorithm).The improved algorithm is designed in parallel.At the same time,aiming at the problem of multi value bias in the Gini index of SPRINT algorithm,a new algorithm for computing the two layer Gini exponent is proposed(D-SPRINT algorithm),and the parallel design of the algorithm is given.In order to further improve the accuracy of decision tree algorithm,this paper proposes a new method to select the splitting attributes of nodes(D-CS algorithm)by combining D-C4.5 algorithm and D-SPRINT algorithm.The parallel design of D-CS is carried out,which makes the algorithm better implemented on Hadoop platform.Finally,the experiments show that the D-C4.5 algorithm and D-SPRINT algorithm have higher accuracy than the unmodified algorithm,and the parallel algorithm runs faster.The accuracy of D-CS algorithm is higher than that of D-C4.5 algorithm and D-SPRINT algorithm,the parallel HD-CS algorithm has a higher speed-up ratio,and is more suitable for processing massive data. |