Font Size: a A A

A Trusted-item-based Interactive Method To Improve The Quality Of Labeled Data And Its Application

Posted on:2022-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:S X XiangFull Text:PDF
GTID:2518306746457394Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The quality of training data is crucial to the success of supervised and semisupervised learning.However,in the current era of big data,with the rapid growth of data scale,it is hard to guarantee data quality.Existing methods allow users to confirm the labels of some samples and treat these samples as trusted items,then correct other unconfirmed samples by a label correction algorithm.To select samples with higher correction gains to confirm,users need to explore the data distribution.However,there still exist two challenges in correcting label errors for large-scale datasets.One is to effectively explore the large-scale data and quickly identify regions with many label errors and items with higher correction gains.The other lies in the efficiency of existing label correction algorithms.The existing algorithm will take several days and much memory to do label correction when the numbers of samples and categories are too large,which is unaffordable.To tackle the challenge of large-scale data exploration,we propose a visual analysis system,Data Debugger,which uses the ”overview + detail” method to display data hierarchically,helping users explore data and correct label errors effectively and efficiently.When constructing the hierarchical structure by the sampling strategy,the labeling outliers should be preserved as much as possible while maintaining the data distribution.Therefore,we propose an outlier-biased sampling method to maintain the data distribution and labeling outliers better.When users switch from the overview level to the detail level for exploration,there should not be significant changes in data distribution since stability is essential for the exploration process.Therefore,we propose an incremental tSNE to maintain the readability and stability of data distribution simultaneously.A case study is conducted to demonstrate the usefulness of our system.For a clothing image dataset with37,497 samples,306 trusted items are confirmed,and the label accuracy improves from61.73% to 75.02%.To tackle the challenge of large-scale data correction,we propose a batch-processing method and an automatic label-hierarchy construction method.The idea of the batchprocessing method is to split large-scale data into batches,correct by batch,and then merge the results.By reducing the number of samples in each correction process,the efficiency of the algorithm is improved.The automatic label-hierarchy construction method aggregates the similar categories into clusters to construct the hierarchical structure,which utilizes the semantics and size of each category.In order to ensure that the number of categories in each correction process is small,we perform label correction in a top-down manner according to the established hierarchy.It takes several days and more than five hundred gigabytes of memory to do label correction for a live game image dataset with over one million samples by the existing algorithm.In contrast,it is reduced to five hours and 14 gigabytes of memory with the proposed algorithm.We combine the proposed algorithm with Data Debugger to correct medical CT images,which increases the AUC from 91.3%to 93.9%.
Keywords/Search Tags:Quality improving of labeled data, trusted item, tSNE, sampling
PDF Full Text Request
Related items