Font Size: a A A

Research On High Dimensional Imbalanced Data Classification Based On Random Forest

Posted on:2019-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:S C XuFull Text:PDF
GTID:2347330569979760Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid growth of information globalization,high-dimens ional unbalanced data is widely available in real life,such as face recognition,spam detection,image retrieval,medical diagnosis,intrus ion detection and biological information mining.How to balance the sample categories of high-dimens ional unbalanced data and classify the data samples is a hot research direction in t he field of machine learning and data mining.The random forest algorithm was first proposed by Breiman,which is a learning algorithm integrated by multiple decision trees.Random forest because of its good classification performance by everyone's attention,random forests and other classification algorithm has certain algorithm compared advantage,main performance in high classification accuracy and generalization error is small,fast training algorithm and easy to parallel computing.However,when the original random forest makes classification prediction for high-dimens ional unbalanced data,the classification performance is reduced and the algorithm is too complex.Against the original random forests in high-dimens ional problem of unbalanced data classification,this paper respectively from the two generous in the face of the original data and algorithm of random forest algorithm is optimized and improved,in this paper,the main work includes the following aspects:(1)In the aspect of data presents a SMOTE of optimization algorithm was proposed to improve data balance — E-SMOTE algorithm,the algorithm based on SMOTE algorithm optimizes,effectively improve the SMOTE algorithm can lead to problems,fuzzy boundaries after balance data effectively ease the imbalance of the impact on the model.(2)This paper proposes a feature selection method is proposed,which is based on the feature selection algorithm,which is based on the data under sampling,which is to screen the features with the importance degree and relevance of the characteristics,remove the features of redundancy,and finally generate the feature subspace.(3)An optimized random forest classification model is proposed — weighted stochastic forest model,which weights and recombined the decision tree model,avoiding that some noise and boundary data can influence the classification accuracy by interference of random forest model,so as to achieve the goal of optimizing model.
Keywords/Search Tags:High dimensional unbalance data, Random forest, Feature selection, Model optimization
PDF Full Text Request
Related items