Font Size: a A A

Study On Label Noise In The Classification

Posted on:2020-02-29Degree:MasterType:Thesis
Country:ChinaCandidate:Q GaoFull Text:PDF
GTID:2417330602452462Subject:Statistics
Abstract/Summary:PDF Full Text Request
Label noise is an important issue in classification,which makes many potential negative consequences.One of typical harms is decreasing the accuracy of predictions.Recently,existing literature on label noise include two main methods: algorithm level approach,mainly aims to design some robust supervised learning algorithms which are little affected by label noise,and data level approach which focuses on identifying and removing mislabeled data or correcting misclassified data.However,the methods,based on the algorithm level approach,are modified with traditional machine learning algorithms,which is lack of versatility.There are some advantages for the data level approaches.The process of dealing with label noise is separated from training classifiers.Moreover,most researchers think the dealt data can be applied to more data situations.There are two major methods in the data level approach.They are noise removing and noise correcting.Compared with the noise removing method,the noise correcting method is a good choice.On the one hand,some important information may be lost when removing directly noise data.On the other hand,removing noise data may be prohibitively expensive when the cost of collecting data is high.Thus,this work is concerned about the correcting of label noise.First,estimating the label noise rate in the data can provide more useful information for label noise correction.We propose a method of estimating noise rate since most of the existing methods for estimating label noise rate are only applicable to binary classification problems.This method aims to identify potential label noise data so that it can supply more beneficial information for the process of correcting label noise.This process consists of three steps.To start with,we use the k NN classifier to derive probability estimates for each instance in the data set by using leave one out cross validation.Then finding thresholds to detect anomalous instances.The thresholds are the mean probability estimates of all examples in the same class.In the end,counting the number of potential incorrectly labeled instances and compute its percentage of all instances.This algorithm not only deals with binary classification but also multi-classification.Second,existing label noise correction algorithms often adopt one of supervised learning method and unsupervised learning method.However,the two methods have different concerns about data.Fully combining the characteristics of them,it can provide more useful information for label noise correction,and thus improve the accuracy of label noise correction in data.Therefore,this paper designs a label noise correction algorithm which combines supervised learning with unsupervised learning.Specifically,this algorithm is based on K-means algorithm and k NN algorithm.The proposed correction method executes one or more times for clustering on a training set.Then using majority voting rules to estimate instances' label and combined with noise rate estimates.And we derive the confidence of labels in the data.Finally,according to the confidence,and using majority voting rules between clusters,the labels of training data are corrected.In this paper,to evaluate the performance of the proposed algorithm,we have chosen some criteria.There are label accuracy,model quality and AUC.Extensive experimental results using real-world data sets are provided.The empirical study shows that,compared with several correction methods,our approach successfully corrects the noise label and improve data quality in many cases.And it makes the classifier achieve higher prediction accuracy.
Keywords/Search Tags:Label noise, Noise correction, Noise rate estimation, Classification
PDF Full Text Request
Related items