Research On High Dimensional Imbalanced Data Classification Based On Random Forest

Posted on:2019-01-22

Degree:Master

Type:Thesis

Country:China

Candidate:S C Xu

Full Text:PDF

GTID:2347330569979760

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

With the rapid growth of information globalization,high-dimens ional unbalanced data is widely available in real life,such as face recognition,spam detection,image retrieval,medical diagnosis,intrus ion detection and biological information mining.How to balance the sample categories of high-dimens ional unbalanced data and classify the data samples is a hot research direction in t he field of machine learning and data mining.The random forest algorithm was first proposed by Breiman,which is a learning algorithm integrated by multiple decision trees.Random forest because of its good classification performance by everyone’s attention,random forests and other classification algorithm has certain algorithm compared advantage,main performance in high classification accuracy and generalization error is small,fast training algorithm and easy to parallel computing.However,when the original random forest makes classification prediction for high-dimens ional unbalanced data,the classification performance is reduced and the algorithm is too complex.Against the original random forests in high-dimens ional problem of unbalanced data classification,this paper respectively from the two generous in the face of the original data and algorithm of random forest algorithm is optimized and improved,in this paper,the main work includes the following aspects:(1)In the aspect of data presents a SMOTE of optimization algorithm was proposed to improve data balance — E-SMOTE algorithm,the algorithm based on SMOTE algorithm optimizes,effectively improve the SMOTE algorithm can lead to problems,fuzzy boundaries after balance data effectively ease the imbalance of the impact on the model.(2)This paper proposes a feature selection method is proposed,which is based on the feature selection algorithm,which is based on the data under sampling,which is to screen the features with the importance degree and relevance of the characteristics,remove the features of redundancy,and finally generate the feature subspace.(3)An optimized random forest classification model is proposed — weighted stochastic forest model,which weights and recombined the decision tree model,avoiding that some noise and boundary data can influence the classification accuracy by interference of random forest model,so as to achieve the goal of optimizing model.

Keywords/Search Tags:

High dimensional unbalance data, Random forest, Feature selection, Model optimization

PDF Full Text Request

Related items

1	High-dimensional Data Based On MIC Feature Selection And Application Research
2	Variable Screening Of Regression Models With Missing Data At Random
3	Research On The Evaluation Method Of Sports Effects Based On Feature Selection
4	Research On High Dimensional Imbalanced Data Classification In The Identification Of Risk User
5	Statistical Classification Analysis For High-dimensional Data
6	Joint Modeling Of Longitudinal And Survival Data With A Random Forest Based Association Structure
7	Research On Classification Of Imbalanced Datasets Based On Random Forest
8	Research On The Influencing Factors Of China’s Population Growth Based On High-dimensional Variable Selection
9	Construction Of Customer Signature And Prediction For The Loss Of Bank’s Personal Loan Customers
10	The Method Of Selecting Local Feature Words And Its Application In Text Classification