Research On High Dimensional Imbalanced Data Classification In The Identification Of Risk User

Posted on:2020-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2417330578973087

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

With the development of big data,the high dimensionality and imbalance of data become the norm.The classical classification algorithm is not effective in dealing with such high-dimensional and imbalanced data,which is mainly reflected in the fact that the classification results are inclined to the majority of class samples.However,in the actual classification problem,the minority of class samples are the focus of attention.Therefore,how to improve the classification ability of the minority of class samples in the classification of high-dimensional imbalanced data has become a hot topic in current research.Current research mainly deals with high-dimensional imbalanced data from three levels: data,feature and algorithm.Firstly,this paper introduces the current research status and relevant theoretical background of these three levels.Then through the research and analysis of risk user identification based on mobile network communication behavior in Jdata Competition,it is found that the introduction of distance measure makes it difficult for traditional data balancing methods to take effects in the classification of high-dimensional imbalanced data.The nature of high dimensionality brings a lot of irrelevant features and redundant features,which makes the perform of classical classification models poorly.Based on the above problems,this paper proposes a two-stage feature selection composite light GBM model based on Filter-Embedded model.Firstly,at the feature level,aiming at the problems of low precision of Filter model and high computational complexity of Embedded model,a two-stage feature selection method based on Filter-Embedded model is proposed,which some redundant and irrelevant features are deleted by m RMR algorithm,low importance and zero importance features are deleted by feature selection method based on tree model.Ten subsets of balanced data are constructed based on Easy Ensemble idea for high-dimensional imbalanced data after feature selection.The light GBM model is used to predict the subset,and then the bagging idea is used to integrate each subset to form a combined classifier.Comprehensively,evaluation of different classifiers by tp,AUC,F1_Score and other evaluation indicators shows that the composite light GBM model performs best on the classification of high-dimensional imbalanced data.The model can improve the classification performance of the minority of class samples,and can adjust the threshold to output classification labels according to actual needs.

Keywords/Search Tags:

Data Classification, High-Dimensional Imbalanced Data, Composite Lightgbm Model, Data Equalization, Feature Selection

PDF Full Text Request

Related items

1	Research On High Dimensional Imbalanced Data Classification Based On Random Forest
2	Statistical Classification Analysis For High-dimensional Data
3	Feature Selection Based On Rough Set For Binary-class Imbalanced Data
4	High-dimensional Data Based On MIC Feature Selection And Application Research
5	Feature Selection Of Unbalanced Data
6	A Study On Classification Of Imbalanced Data And Evaluation Metrics
7	A Study Of Model Shape Analysis And Training In Universities Of Henan Province Based On Body Examination Data
8	Research On Application Of Data Mining Model In Smart Education
9	Feature Screening For Ultrahigh-Dimensional Survival Data And Outlier Detection
10	Research On Students' Self-control Analysis Algorithm Based On Campus Big Data