Application Research On Ensemble Learning Of Unbalanced Classification

Posted on:2018-11-25

Degree:Master

Type:Thesis

Country:China

Candidate:W Cong

Full Text:PDF

GTID:2359330518997514

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

The dataset with skew distribution widely exists in the real world. In many areas,the importance of the correct classification of minority class is often higher than that of majority class samples for the problems of unbalanced classification of’ skew distribution dataset. When the classification model is established on the dataset with skew distribution, most of the classical classification algorithms are on the premise that the training set has a balanced class distribution or various samples have the same misclassification cost. As a result, the unbalanced class distribution gives rise to the performance degradation of these classification algorithms. In this case, the information of minority class samples is always overshadowed by majority class sample information, resulting in a higher classification error rate of the former than that of the latter. Therefore, the research of unbalanced classification is drawing more and more attention, which becomes the hot and difficult point in the field of data mining. Before discusses the application of unbalanced classification problem, this paper first describes the content and current situation of it, and makes a review in terms of sampling method, classification algorithm. Then, according to the advantage of the ensemble learning algorithm in dealing with the unbalanced data compared with the single classifier, further summarizes the present situation of the ensemble learning application of the unbalanced classification problem, and expounds ensemble learning algorithm related application in detail. In this paper, there are two applications for the unbalanced classification based on ensemble learning model:In the first part, based on the unbalanced financial data in 2014 of listed companies on the Shanghai Stock Exchange of A share, the Hellinger Distance Based Random Forest (HDRF) model is used to research the financial warning model specification of listed companies from the perspective of the ST share classification prediction. The random forest algorithm based on Hellinger distance can integrate the diversity of random forest and the skew-insensitivity characteristics of Hellinger distance decision tree. In the experiment, random forest. Bagging, AdaBoost and rotation forest ensemble classifier based on C4.5 decision tree and corresponding ensemble classifier based on Hellinger decision tree are chosed to make comparation.The experimental results show that the random forest algorithm based on Hellinger distance possesses relatively better comprehensive classification performance in the unbalanced classification application of ST shares of listed companies in the indexes of area under the ROC curve and Fmeasure, and the HDDT as the base classifier can improve the unbalanced classification performance of the ensemble model.In the second part, the application of unbalanced classification model is expanded. For the research of customer keeping in the field of client relationship management, this section focuses on the prediction problem of customer loss in commercial banks, the CVParameterSelection is applied to support vector machine combined kernel function parameter optimization, the Relief-SVM customer churn prediction model based on EasyEnsemble is established, and the application on commercial bank customer data verifies that this model we proposed in this sector makes greater improvement on the AUC, Fmeasure index compared with the single kernel function EasyEnsemble based Relief-SVM classification model and traditional Bagging and AdaBoost ensemble classifier based on C4.5 decision tree.Therefore, EasyEnsemble based Relief-SVM customer churn prediction model with combined kernel function parameter-optimization has been proven an effective way to handle the problem of customer churn classification and prediction in commercial banks. Not only can more accurately predict the potential churn customers, but also take into account the overall classification accuracy of customers, which makes possible for churn customers to develop customer retention decisions, and ultimately achieve the goal of customer keeping as far as possible.Finally, this paper sums up the application cases of unbalanced classification method based on ensemble learning in these two parts, and the deficiencies are analyzed and the future research is prospected, hoping to carry out effective knowledge in some unbalanced data in the economic management field.

Keywords/Search Tags:

unbalanced classification, ensemble learning, Hellinger distance decision tree (HDDT), Hellinger Distance based Random Forest (HDRandom Forest,HDRF), EasyEnsemble Relief-SVM model, combined kernel function

PDF Full Text Request

Related items

1	Research On Classification Of Unbalanced Financial Data Based On Ensemble Learning
2	The Application Of Minimum Hellinger Distance Method In Parametric Estimation Of Diffusion Processes
3	Knowledge Discovery Of Vehicle Credit Data Based On Decision Tree Ensemble Learning
4	Classification Research Of Resold House Data Based On Ensemble Learning
5	Research On Personal Network Loan Behavior Based On Decision Tree And Random Forest
6	Research On Prediction Of Personal Credit Default By Improved Random Forest Model
7	Research On Assessment Model Of Financing Structural Ability Of The Small And Medium-sized Enterprises
8	Application Of Supervised Machine Learning On Poverty Identification Of Rural In Gansu Province
9	The Bayesian Classification Model Based On Feature Selection Using Random Forest And Its Application
10	Credit Card Default Prediction Based On Weighted Stacking Ensemble Learning