| With the continuous development of the Internet,there is an explosive growth trend of information on the network,which usually exists in the form of unstructured text.In order to effectively utilize these unstructured texts,it is necessary to annotate the key information in them.Due to manual labeling,there are problems such as inefficiency,cumbersomeness,and prone to labeling errors.However,existing annotation tools provide limited support in controlling data quality and reducing annotation workload.Therefore,the current manual annotation requires a lot of cost,and it is difficult to ensure the quality of the labeled data.This paper conducts research under this background,and the main work includes the following two aspects:(1)In view of the shortcomings of the existing annotation tools in terms of operation convenience,annotation quality,and annotation efficiency,this paper designs and implements a convenient and easy-to-use text annotation platform.The platform not only realizes the basic functions of annotation tools,but also provides data quality analysis functions to control the quality of annotation data and auxiliary annotation functions to reduce the workload of annotation personnel,and provides support for online model training.This paper compares the time cost of labeling the same data with and without the auxiliary labeling function.The experimental results show that the auxiliary labeling function designed in this paper can effectively improve the labeling efficiency.(2)In view of the lack of support provided by traditional annotation tools in selecting the text-first annotation that can improve the performance of the model,this paper proposes an auxiliary annotation method based on active learning,which mainly evaluates the uncertainty of text relative to the model through the mean value of text information entropy.The higher the uncertainty,the greater the performance improvement of the model.The method first predicts the character-level label probability of unlabeled text through a deep learning model,then calculates the average information entropy of all unlabeled text on this basis,and finally sorts the unlabeled text according to the uncertainty of the text relative to the model.So that users can preferentially label text with high uncertainty relative to the model.In this paper,the effectiveness of the method is proved by comparative experiments on the boson dataset.Compared with other methods,selecting the sample training model by the mean information entropy can improve the model performance more quickly,and then make the model prediction effect better. |