Font Size: a A A

Research On High Dimensional Imbalanced Data Classification In The Identification Of Risk User

Posted on:2020-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2417330578973087Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of big data,the high dimensionality and imbalance of data become the norm.The classical classification algorithm is not effective in dealing with such high-dimensional and imbalanced data,which is mainly reflected in the fact that the classification results are inclined to the majority of class samples.However,in the actual classification problem,the minority of class samples are the focus of attention.Therefore,how to improve the classification ability of the minority of class samples in the classification of high-dimensional imbalanced data has become a hot topic in current research.Current research mainly deals with high-dimensional imbalanced data from three levels: data,feature and algorithm.Firstly,this paper introduces the current research status and relevant theoretical background of these three levels.Then through the research and analysis of risk user identification based on mobile network communication behavior in Jdata Competition,it is found that the introduction of distance measure makes it difficult for traditional data balancing methods to take effects in the classification of high-dimensional imbalanced data.The nature of high dimensionality brings a lot of irrelevant features and redundant features,which makes the perform of classical classification models poorly.Based on the above problems,this paper proposes a two-stage feature selection composite light GBM model based on Filter-Embedded model.Firstly,at the feature level,aiming at the problems of low precision of Filter model and high computational complexity of Embedded model,a two-stage feature selection method based on Filter-Embedded model is proposed,which some redundant and irrelevant features are deleted by m RMR algorithm,low importance and zero importance features are deleted by feature selection method based on tree model.Ten subsets of balanced data are constructed based on Easy Ensemble idea for high-dimensional imbalanced data after feature selection.The light GBM model is used to predict the subset,and then the bagging idea is used to integrate each subset to form a combined classifier.Comprehensively,evaluation of different classifiers by tp,AUC,F1_Score and other evaluation indicators shows that the composite light GBM model performs best on the classification of high-dimensional imbalanced data.The model can improve the classification performance of the minority of class samples,and can adjust the threshold to output classification labels according to actual needs.
Keywords/Search Tags:Data Classification, High-Dimensional Imbalanced Data, Composite Lightgbm Model, Data Equalization, Feature Selection
PDF Full Text Request
Related items