| In recent years,with the rapid development of the Internet and the era of big data,the increasing amount of massive data and data annotation have simultaneously promoted the application of machine learning and deep learning in various fields.Data quality is an important factor affecting the application of data mining and learning algorithms.The application process of learning algorithms involves three main entities:the raw data,the data labels,and the target task.Due to many objective factors,such as the high professionalism and subjectivity of the labeling process,the massive data growth and high labeling cost,and the variability of the target task in the application scenario,there are inevitably two mismatches between the above three entities in the application process of learning algorithms.For example,the human subjectivity of the annotation process will lead to mismatch between the raw data and the data labels,i.e.,the unavoidable presence of noisy labels in the data.And the low-quality problem of data caused by the mismatch will bring a great challenge to the application effect of learning algorithms.Therefore,it is an important research issue to alleviate the low quality of data caused by mismatch and improve the utilization of data mining.In this dissertation,we focus on the mining and application of low-quality data under mismatch cases,and investigate the problem of low-quality data due to the mismatch between the three entities involved in the application process,namely,the raw data,the data labels and the target task.The main research contents and contributions of this dissertation are as follows:·Mismatch between raw data and data label:this dissertation proposes an efficient noisy label detection scheme for incremental datasets in data systems.A mismatch between raw data and data labels means that the data has noisy labels.Existing work tends to perform noisy label detection for well-collected datasets,and lacks focus on incremental data scenarios.For the case of mismatch between raw data and data labels,the existing work has the following problems in incremental scenarios:large computational overhead due to repeated training;difficulty in detecting noisy labels accurately for specific incremental datasets.How to perform efficient noisy label detection for incremental datasets in data systems is still a problem to be explored.In this dissertation,we propose an efficient and accurate noisy label detection scheme for incremental datasets in data systems,which includes a fine-grained noisy label detection method using contrast sampling.The fine-grained noisy label detection method takes into account the relationship between label probability,output confidence and feature representation,and requires only a small amount of fine-tuning of the general model for efficient and accurate noisy label detection on incremental datasets.Extensive experiments show that the proposed framework can efficiently and accurately perform noisy label detection on incremental datasets under different noise rate settings.·Mismatch between raw data and target task:this dissertation proposes a scheme for semi-supervised learning algorithms with the class mismatch problem.Due to the limitation of data labeling cost,there is often a huge amount of unlabeled data in the data system.Semi-supervised learning is a learning paradigm that uses a small amount of labeled data and a large amount of unlabeled data for collaborative training.However,the cheap and large amount of unlabeled data inevitably contains many samples that do not match the target task,so in scenarios where the raw data does not match the target task,class mismatch will significantly degrade the performance of existing semi-supervised learning algorithms.In this dissertation,we propose a scheme to mitigate the performance degradation of traditional semi-supervised learning methods in the case of class mismatch.It consists of three main training techniques,entropy repulsion loss,batch annealing,and data reloading,which work together to moderate the overinvolvement of potentially task-irrelevant class mismatch data in the training process.Compared with the original semi-supervised learning method,methods with the proposed framework can significantly mitigate the performance degradation caused by the participation of class mismatched samples in training.·Mismatch between data label and target task:this dissertation proposes a key activity detection scheme based on coarse-grained labels.In real-world scenarios,where the target tasks and requirements are variable,it is often impractical to repeatedly collect and label datasets.Deep and effective mining of existing data as well as improving the utilization of available data are among the key issues.In real data systems,the labels known to be available related to the target task may come from other auxiliary information,such as user interaction information in IoT systems with human-computer interaction.To address the mismatch between data labels and target tasks in IoT systems,this dissertation proposes a key activity detection scheme based on coarse-grained labels and raw sensor data.It is expected to investigate the exact start and end times of key activities using only coarse-grained labels and raw sensor data.A siamese key activity attention network is designed for learning to construct associations between temporal features of sensor data and discriminative tasks.By interpreting the inferred behavior of the deep model,the framework can infer to locate the region where the target key activity occurs.Furthermore,based on the proposed approach,a userinteraction friendly sensor data annotation system is designed and evaluated in this dissertation,which helps in large-scale sensor data annotation. |