Font Size: a A A

Study On Optimization Of Random Forests Algorithm

Posted on:2015-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z F CaoFull Text:PDF
GTID:1267330428960293Subject:Statistics
Abstract/Summary:PDF Full Text Request
Random forests are a combined classifier based on statistical learning theory. Throughcombining bootstrap re-sampling method and decision tree algorithm, its essence is tobuild a collection tree classifier h k (x), k1,, and then use the set by voting forclassification and prediction. Because the algorithm breaks through the bottleneck that thesingle classifier cannot improve in performance, it has good performance and can beapplied to various kinds of classification filtering and prediction. Certainly there are alsosome points needing to be improved in the algorithm. Aiming at these deficiencies,theoretical research is mainly conducted in three aspects. One is the introduction of newalgorithm, second is to blend in data preprocessing in the algorithm, and the third is theoptimization of the algorithm structuring process. On the basis of full access to relevantinformation from both China and abroad, this paper concentrates on the study of latter twoaspects optimization.In terms of data preprocessing, this paper presents two optimization algorithms toimprove the random forests.First, in view of the random forest’s inability to well deal with the issue of unbalanceddata, and in line with clustering algorithm and the center of gravity in physics, this paperputs forward the C_SMOTE algorithm, which can reduce the data set unbalance, so as toimprove the random forest classification performance. Aiming at SMOTE algorithmhaving certain blindness and prone to the problem of marginalization in the selection of"artificial" sample, the algorithm put forward the starting from the gravity center of thenegative samples and with the new thoughts to purposely structure "synthetic" samples,which makes new samples have the trend of convergence to the gravity center in theprocess of structuring " synthetic " negative samples and effectively solves the defects ofthe SMOTE algorithm. It not only keeps the information of original data set, but also bettersolves the problem of unbalanced data sets, which to a large extent, improves the randomforest algorithm in classification performance of unbalanced data sets.Second, random forest often uses C4.5algorithm for node split, but in dealing withcontinuous variables, C4.5algorithm uses dichotomy discretization method with itsoperation efficiency depending on the number of continuous variable values. The larger isthe number, and the longer is the execution time of random forests. Aiming at this problem,this paper puts forward a new algorithm to reduce the number of continuous variablevalues. This algorithm can provide simple data set for C4.5algorithm, so as to improve the execution efficiency of C4.5algorithm. It uses2correction formula to deal with thedeviation in CHI2series algorithm. By using three kinds of common UCI data sets, thispaper comparatively analyzes the new algorithm and the CHI2series algorithm in terms ofimproving the performance of random forest. Empirical data show that compared withCHI2series algorithm, the new algorithm can reduce the redundant information of data setmore effectively, making the number of continuous value greatly reduced and thus toimprove the execution efficiency of random forests.In terms of random forests structuring process optimization, through analyzing thefactors affecting the classification performance of random forests and aiming at nodesplitting method difference causing random forest classification performance difference inrandom forests generating process, this paper proposes a node split hybrid algorithm basedon linear combination. The algorithm brings the function of C4.5algorithm and CARTalgorithm in the node split and forms a linear combination function.Through theconversion of combination function coefficient, it gives full play to the advantages of thesetwo algorithms and realizes the random forest classification performance optimization. Inthe mean time it is also analyzed in detail the stability, relevancy, and strength of the hybridalgorithm. First of all, by constructing F statistic variance analysis, the stability of thehybrid algorithm is inspected. Statistical results show that the hybrid algorithm of randomforest has certain instability as the change of the number of trees in the forest, but whenthere are more than800trees in the forest, the algorithm can achieve the stable state. Thenthe correlation and intensity of hybrid algorithm are theoretically derived and discussed,and meanwhile the average correlation and strength of random forests are calculated.Furthermore, empirical analysis is used to verify that there is negative relationship betweenthe average correlation and algorithm classification accuracy, the average intensity offorest and classification accuracy of algorithm are positively related, and comparing withother algorithms, the hybrid algorithm has obvious advantages in improving averageintensity and reducing average correlation of the forest, and also from another aspect, thesuperiority of the hybrid algorithm is verified.In practical application of selecting high quality stock pool, there are a large numberof continuous variables in the application of data sets, and it has a high accuracyrequirement for classification algorithm. The optimization algorithm proposed in this paperis a good way to deal with continuous variables and improve the classification precision ofrandom forests. Based on screening stock index system with value growth investmentstrategy, this paper uses wavelet analysis and COR_CHI2algorithm for data preprocessing, using random forests from node split hybrid algorithm to successfully realize the choice ofhigh quality stock pool, so as to provide statistical support to investors for targetedinvestment portfolios.
Keywords/Search Tags:Random forests, unbalanced data, continuous variables discretization, node split the study of optimization
PDF Full Text Request
Related items