Font Size: a A A

Research On Label Noise Learning Algorithms Under Complex Classification Scenarios

Posted on:2023-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:B Y ChenFull Text:PDF
GTID:1528307304491984Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of society into the era of the digital economy,the importance of the economic position of data,as one of the production factors,is increasingly prominent.How to refine valuable information from the massive amount of data becomes the objective and direction of comprehensive deepening reform.However,since the speed of manual annotation can hardly match the growth rate of the data,the label noise that comes with massive data hinders the accurate refinement of information,especially for supervised classification learning tasks.Hence,how to mitigate the negative influences of label noise and how to enhance the robustness of classification algorithms become a hot topic deserving in-depth exploration.Existing label noise learning methods suffer from over-filtering problems aroused by the asymmetric label noise in binary classification,lacking a general multiclass label noise learning framework in multiclass classification and an oversampling method robust to label noise in imbalanced classification problems,making it challenging to cope with label noise under complex classification scenarios.Based on the existing label noise learning methods,this thesis investigates the characteristics of label noise in asymmetrically distributed dichotomous,multiclass and imbalanced data sets,and proposes a self-adaptive label noise learning method for binary classification,a general label noise learning framework for multiclass classification,and a robust label noise learning method for imbalanced classification,respectively,to improve the effectiveness and generalizability of label noise learning methods under different classification scenarios.The main contributions of this thesis are as follows:(1)Aiming at the asymmetric label noise in dichotomous data sets,a self-adaptive label noise learning algorithm is proposed to solve the over-filtering problem of training samples due to threshold failure under asymmetric noise and alleviate the under-fitting problem of classification models in binary classification with asymmetric label noise.In this thesis,a softer hypothesis is employed to substitute the hard hypothesis in the original relative density,and the label-noise samples and non-noisy samples can be clustered adaptively into two clusters based on the variability of the local distribution characteristics of the samples,transferring the global threshold variable into the classdependent local variable and eliminating the dependence of the label-noise learning method on the threshold settings.In addition,a power function is introduced to amend the value of relative density,adjusting the gaps between different relative densities and extending the adaptivity of the algorithm to obtain a more accurate and comprehensive identification of label noise.Experiments show that the proposed algorithm can achieve the best trade-off between label noise recognition rate and false recognition rate on real datasets,and retain the original distribution characteristics of data samples maximally at the lowest cost.It effectively improves the generalization ability of classifiers in binary classification scenarios and demonstrates particularly significant anti-interference capacity against asymmetric noise rates.(2)Aiming at the label noise in multiclass data sets,the definition of multiclass label noise is initially defined,and a general learning framework for multiclass label noise is derived from this definition.In addition,the proposed learning framework is instantiated with the completely random forest algorithm and the relative density algorithm.Furthermore,two methods for optimizing the noise intensity threshold are proposed from the reliability perspective and the efficiency perspective,respectively,i.e.,a novel voting cross-validation method and an adaptive method,achieving the optimal enhancements of the noise intensity threshold,further extending the generality of multiclass label noise learning methods.Experiments on both synthetic and real datasets show that the proposed general learning framework for multiclass label noise can effectively mitigate the generalizability degradation problem of multiclass classifiers caused by label noise in multiclass classification scenarios,and the framework can be combined with arbitrary imbalanced sampling algorithms in a loosely coupled manner to deal with multiclass imbalanced label noise,achieving effective improvements in data quality and boosting the classification performance and robustness of multiclass classifiers to label noise.(3)Aiming at the label noise in imbalanced data sets,a robust label noise learning algorithm for imbalanced classification is designed to address the issue caused by asymmetric,imbalanced,and multiclass label noise.First,a robust synthetic minority oversampling algorithm is proposed to heuristically identify label noise and automatically sample non-noisy samples within the partitioned regions,effectively improving the generalizability of classifiers in dealing with imbalanced data sets by improving the robustness of oversamplers to label noise.Then,a multiclass self-adaptive label noise learning algorithm,extended based on the general multiclass classification label noise learning framework,is applied to filter the label noise in the oversampled multiclass data sets,thereby jointly combating the interference of asymmetric and imbalance label noise in multiclass data sets to find the optimal classification boundary by strengthening the classification boundary and cleaning the noisy data in multiclass classification scenarios.Experiments show that the algorithms proposed in this thesis can achieve effective improvements in both data quality and classification performance when applied individually;their combinations can also form complementary advantages and further enhance the robustness of a single algorithm against label noise in complex classification scenarios.
Keywords/Search Tags:Classification learning, Label noise, Relative density, Complete random forest, Adaptive algorithm
PDF Full Text Request
Related items