| In the era of massive information,the understanding of information content has become increasingly important.An important method is to label the content with appropriate semantic tags.Predict semantic tags of web comments to filter harmful comments,use a set of keywords to annotate scientific literature,etc.Manual labeling is inefficient and uneconomical,so it is of great significance to research high-performance multi-label semantic labeling algorithm.Most traditional multi-label text semantic indexing algorithm based on statistical machine learning techniques.With the rapid development of deep learning in recent years,it gradually became the best practice in the field of natural language processing,this paper focus on the text semantic indexing algorithms based on the deep learning,according to the following logic:(1)For text semantic indexing problems with small label space(the amount of labels is small),this article uses the classic binary relevance method to transform the multi-label problem into multiple single label problems,using BERT as the base classifier in transfer learning paradiam,which showed a strong performance in the field of natural language processing.The result is derived by aggregating all the result of the base classifiers.(2)For the multi-label text semantic indexing problems with large label space(large range of optional labels),the resource consumption of binary relevance method is too large(problems with q optional labelss need to train q classifiers,and q classifiers need to work simultaneously in the inference stage),and it is not easy to make use of the relationship between labels.In this paper,a neural network structure sharing weights is designed to predict all tags at the same time and reduce computational resource consumption.In addition,a plug-and-play(independent of specific network structure)multi-task learning structure is designed to efficiently utilize the relationship between labels.(3)Facing the problem of unbalanced training data,the common method based on data sampling is not suitable for the neural network algorithm sharing weights and predict all labels at one time.In this paper,a simple and effective method was designed to alleviate the impact of data imbalance.Focal loss was set as the optimization target,and the classification threshold was dynamically adjusted according to the proportion of positive and negative examples of the category.Using the designed method,the effect of data imbalance on algorithm performance could be alleviated to some extent.(4)In the scene of large scale of training data,the data may not be able to be loaded into the memory at one time.At the same time,the training speed of a GPU or even a single host may not meet the application requirements.In order to make the algorithm designed in this paper more suitable for practical application scenarios,a highly scalable algorithm implementation is adopted in this paper,including protocol buffers training data storage format,pipelining data loading and transformation,and ring all-reduce-based distributed training process.In this paper,reasonable comparative experiments were designed to verify the effectiveness of the designed algorithms,the designed method achieved good performance in the experimental data set: for kernels of kaggle jigsaw toxic comment data set,the auc-roc index ranked second,and it could enter 6% in the leaderboard.Micro precision was superior to all other solutions in BioASQ Task 5A dataset,and micro recall ranked third among the best solutions submitted by all participating teams. |