A Trusted-item-based Interactive Method To Improve The Quality Of Labeled Data And Its Application

Posted on:2022-08-05

Degree:Master

Type:Thesis

Country:China

Candidate:S X Xiang

Full Text:PDF

GTID:2518306746457394

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The quality of training data is crucial to the success of supervised and semisupervised learning.However,in the current era of big data,with the rapid growth of data scale,it is hard to guarantee data quality.Existing methods allow users to confirm the labels of some samples and treat these samples as trusted items,then correct other unconfirmed samples by a label correction algorithm.To select samples with higher correction gains to confirm,users need to explore the data distribution.However,there still exist two challenges in correcting label errors for large-scale datasets.One is to effectively explore the large-scale data and quickly identify regions with many label errors and items with higher correction gains.The other lies in the efficiency of existing label correction algorithms.The existing algorithm will take several days and much memory to do label correction when the numbers of samples and categories are too large,which is unaffordable.To tackle the challenge of large-scale data exploration,we propose a visual analysis system,Data Debugger,which uses the �overview + detail� method to display data hierarchically,helping users explore data and correct label errors effectively and efficiently.When constructing the hierarchical structure by the sampling strategy,the labeling outliers should be preserved as much as possible while maintaining the data distribution.Therefore,we propose an outlier-biased sampling method to maintain the data distribution and labeling outliers better.When users switch from the overview level to the detail level for exploration,there should not be significant changes in data distribution since stability is essential for the exploration process.Therefore,we propose an incremental tSNE to maintain the readability and stability of data distribution simultaneously.A case study is conducted to demonstrate the usefulness of our system.For a clothing image dataset with37,497 samples,306 trusted items are confirmed,and the label accuracy improves from61.73% to 75.02%.To tackle the challenge of large-scale data correction,we propose a batch-processing method and an automatic label-hierarchy construction method.The idea of the batchprocessing method is to split large-scale data into batches,correct by batch,and then merge the results.By reducing the number of samples in each correction process,the efficiency of the algorithm is improved.The automatic label-hierarchy construction method aggregates the similar categories into clusters to construct the hierarchical structure,which utilizes the semantics and size of each category.In order to ensure that the number of categories in each correction process is small,we perform label correction in a top-down manner according to the established hierarchy.It takes several days and more than five hundred gigabytes of memory to do label correction for a live game image dataset with over one million samples by the existing algorithm.In contrast,it is reduced to five hours and 14 gigabytes of memory with the proposed algorithm.We combine the proposed algorithm with Data Debugger to correct medical CT images,which increases the AUC from 91.3%to 93.9%.

Keywords/Search Tags:

Quality improving of labeled data, trusted item, tSNE, sampling

PDF Full Text Request

Related items

1	Continuous Quality Improvement Management System Research And Development
2	Research On Deep Web Data Source Selection Method Based On Sampling
3	New techniques for improving biological data quality through information integration
4	Using quantitative and qualitative methods to evaluate survey item quality: A demonstration of practice leading to item clarity
5	Research On Improving Quality Of Service In SIPHello Media Stack
6	Adaptive classification of scarcely labeled and evolving data streams
7	Research On Sampling Based Aggregate Query Method Of Power Quality Data
8	Descriptive and inferential attribute data analysis of aircraft structural joins as a model for improving quality on mature production programs
9	Analysis And Assessment Of Data Quality For Data Warehouse
10	A System for Stratified Sampling of Entity Resolution Results to Assess and Improve Accuracy with Minimal Clerical Review Effor