Font Size: a A A

Research And Application Of Classification Technology For Unbalanced Data

Posted on:2022-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhangFull Text:PDF
GTID:2518306317993979Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The problem of data imbalance classification widely exists in many fields such as industrial production,finance,information security,etc.and it is one of the hot topics of continuous research in recent years.In the imbalanced data,Some kinds of samples often account for a small proportion.When using traditional data mining classification algorithms to deal with the classification problem of such data sets,it is easy to misclassify the small class samples,but the identification of these classes is the focus of the user's attention.and has a higher research value.Unbalanced data set features such as unbalanced class labels,overlapping classes,and too few small class samples are the main reasons that affect classification difficulties.At present,the methods to solve the problem of unbalanced data classification are mainly sample sampling methods at the data level and classifier improvement methods at the algorithm level.There are some problems in the existing sample sampling processing methods,such as under-sampling methods easy to lose effective information,and over-sampling methods easy to lose effective information.Lead to over fitting ? The improved methods of the classifier algorithm have their own merits and limitations.This paper conducts two researches on the imbalance of data in classification,and applies the research results to medical diagnosis problems.At the data level,this paper proposes a TLS(Tomek Links-Smote)hybrid sampling algorithm for data imbalance processing in response to the data imbalance in the labeled data set.For all the samples in the data set,all the sample points belonging to the "Tomek Links pair" in the data set are selected,and removed the remaining small sample points is oversampled by the Smote algorithm The majority class sample of and the new minority class sample are combined to obtain a new training data set.The UCI data sets are used to experiments,and the experiment result proves that the TLS sampling algorithm has a better recognition effect on small samples in the data set than the traditional unbalanced processing sampling algorithm.At the algorithm level,the decision rules of random forest classification is changed in this article.an absolute majority voting method is usually used in traditional random forest algorithm,and the final output category is determined by the mode of the category output by the decision tree.This method seems fair,but the classification ability of each decision tree in the random forest is different,and the classification results of its performance are also good or bad.If the voting weight is the same,it will often lead to errors in the final result.This situation is particularly obvious in an unbalanced data set.the classification results are often biased towards most categories.In this paper,the decision rule of random forest,"the output category is determined by the mode of the output category of the decision tree" is changed to depend on the ratio of the number of small-class sample trees predicted to be large-class sample trees in the random forest.Through a large number of experiments,it is found that this can improve the recognition effect of small samples in the data set by the experimental verification in selected data sets,The evaluation index results of each data set are simply visualized,the advantages and disadvantages of each algorithm model are compared and analyzed more intuitively,which proves the effectiveness of the improved random forest method in imbalanced data classification.All works are used for diabetes diagnosis,three diabetes diagnosis models are established: TLS(Tomek links smote),improved random forest,TLS + improved random forest.The rationality of various models is analyzed,which can help doctors to diagnose diabetes.
Keywords/Search Tags:Imbalanced data, Tomek Links, Smote, Random forest, Classification
PDF Full Text Request
Related items