| In supervised learning,data labels are crucial to the quality of a model.With the rapid development of artificial intelligence,the need to obtain large-scale data labels has increased accordingly.The traditional way of obtaining data labels relying on domain experts is costly and unsuitable for large-scale data.With the rise of crowdsourcing technology,crowdsourcing platforms can quickly and cheaply acquire labels from multiple network users(crowd workers)for each instance,thus obtaining crowdsourced data with a multiple noisy label set for each instance.To train models on the crowdsourced data,a common method is to infer the true label of each instance from its multiple noisy label set through label integration to obtain the integrated labels of instances for supervised learning.Therefore,label integration is crucial to deciding whether the collected data can be used effectively.In recent years,label integration has received much attention from scholars,and many label integration algorithms have been proposed.Although the existing algorithms have made some progress in improving the performance of label integration,further improvement is needed for better application in the real world.To improve the performance of label integration,based on the characteristics of crowdsourced data,three aspects can be taken into account: estimating the label quality,augmenting the number of labels,and utilizing the label distribution.The analysis of existing algorithms from these three aspects reveals that: 1)most of them focus on more accurately estimating the qualities of labels,and almost no effort has been made to augment the multiple noisy label sets.2)only a few of them focus on mining the label distribution information of instances,and where the algorithm based on the margin of the positive and negative label probabilities in the label distribution,which is effective in the binary classification task,cannot be applied to the multi-classification task.Aiming at the shortcomings of existing algorithms,this paper proposes two novel algorithms: Label Augmentation and Weighting-based Label Integration(LAWLI)and Between-class Margin-based Label Integration(BMLI).Extensive experiments on simulated and real crowdsourced datasets verify the effectiveness of the two algorithms.The main contents and innovations of this paper include:1)We introduce the characteristics of crowdsourced data.Based on them,we analyze the existing label integration algorithms at home and abroad from these three aspects:estimating the label quality,augmenting the number of labels,and utilizing the label distribution,and find that there is almost no relevant work on augmenting the number of labels.Therefore,the existing algorithms are roughly divided into two major categories:label integration algorithms based on the label quality and the label distribution,and the classical methods are described in detail.2)We propose a novel label integration algorithm: Label Augmented and Weighted Majority Voting(LAWLI).LAWLI focuses on augmenting each instance’s multiple noisy label set without increasing the annotation cost by hiring more crowdsourced workers.Therefore,LAWLI uses the KNN algorithm to find K nearest-neighbor instances for each instance and uses their multiple noisy label sets to augment its multiple noisy label sets.In order to reduce the influence of the labels of the nearest neighbor instances in its augmented multiple noisy label set that do not belong to the same class as the instance,LAWLI also weights the labels from different nearest neighbor instances among them using the distance and the label similarity between the instance and its nearest neighbor instances.After label augmentation and weighting,LAWLI obtains the integrated labels for each instance by simple weighted majority voting.Extensive experiments on simulated and real crowdsourced datasets validate the effectiveness of LAWLI.3)We propose a novel label integration algorithm: Between-class Margin-based Label Integration(BMLI).BMLI focuses on how to further mine the information in the label distribution to extend the label integration algorithm based on the margin of positive and negative label probabilities to multi-classification tasks.Therefore,the reasons why this algorithm cannot handle multi-classification tasks are analyzed,and then the BMLI algorithm is proposed.BMLI first calculates the label distribution of each instance using its multiple noisy label set and obtains its initial integrated label by majority voting.Then the crowdsourced dataset is divided into clean and noisy sets using the margin between the first and second largest label probabilities in the label distribution.Finally,a classifier is constructed on the clean set to predict the instances in the noisy set,and the predicted labels are used as the final integration labels of the instances.Extensive experiments on simulated and real crowdsourced datasets validate the effectiveness of BMLI. |