Font Size: a A A

Research On Classification Algorithm For Imbalanced Data

Posted on:2021-11-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1488306455963719Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is one of the core tasks of big data analysis.There are many typical applications in the national economy and people’s livelihood such as customer behavior mining,medical diagnosis,and disaster warning,etc.,and it is one of the current and warmest research directions of classification.However,if the data collected in the real environment and since it is not manually adjusted,the types of data will display an imbalanced distribution in quality,which will have a negative impact on the classification model and have small number of data,such as bank bad debts,cancer data,disaster data,etc.Most of the feature selection and sampling algorithms are aim to maximize the classification accuracy,which is beneficial to classificaiton of majority class but restricts the recognition of minority class.Thus,it is urgent to handle the imbalanced classification problem.The researched areas centered on an in-depth exploration of various techniques with respect to imbalanced data learning mainly consists of methodologies related to feature selection,undersampling,and the parameters of algorithms that need practical guidance and optimization.The main research contribution of this dissertation are as follows:(1)In order to solve the issues which are low efficiency,stuck in local optima,and difficult in parameter setting,a feature selection technique based on adaptive grid search and improved Laplacian Eigenmaps is proposed.The main idea is to compute the Laplacian score of each feature,design the search strategy of feature subset to find the optimal subset,and then determine the parameter by grid search algorithm.The experimental results demonstrate that the proposed technique retains essential advantages of imbalanced data learning in terms of accuracy,as well as appropriate to overcome the easy fall into local optimum and the computation which were expensive to an extent.Furthermore,the proposed algorithm is compared with the literature algorithm,it also reflects a competitive advantage.(2)In order to solve the issue of “underfitting” in random undersampling,an undersampling algorithm based on distance threshold clustering is proposed.On the basis of discriminant samples,the boundary data in the clusters are selected via clustering to reconstruct the training subset;the number of majority class samples is controlled by distance threshold and then the useful samples are maintained,thus a balanced training subset is obtained.The simulation experiments and statistical analysis results show that the proposed method is better than comparison algorithms such as SMOTEBoost in terms of accuracy and MCC.(3)In this study,an undersampling algorithm based on hybrid sampling and distance-constrained clustering is proposed to solve the issue of parameter settings depends on empirical values and the efficiency expensive in the proposed undersampling method based on distance threshold clustering.The hybrid sampling is used to balance the class distribution;the clustering-based method is used to selection border samples;The distance constraints are used to control the number of various class;thus,we aim to improve the classification performance by parameter optimization.Simulation experiments and statistical analysis results demonstrated the effectiveness of the proposed method.(4)An undersampling algorithm based on affinity propagation is proposed to improve the issue in parameter setting of under sampling algorithm based on hybrid sampling and distance-constrained.The samples of majority class can be clustered adaptively by a seondary clustering strategy.The first clustering algorithm is used to determine the number of clusters in majority class;the second clustering algorithm is used to select the border samples.A large number of simulation experiments and statistical test results show that the overall performance of the algorithm is better than the benchmark algorithm in terms of overall accuracy.
Keywords/Search Tags:Class-imbalance learning, Adaptive grid search, Constrained clustering, Affinity propagation
PDF Full Text Request
Related items