| Label ambiguity,which means there may be correlation between different labels,is a common problem in many natural language processing(NLP)tasks such as text classification,sentiment analysis,and named entity recognition.Label distribution learning is a new supervised learning method for such problems,which has achieved remarkable results in computer vision,and biological information classification tasks.In the NLP community,however,there is little work investigating label distribution.Tasks in NLP are different with tasks in other fields.Instances of NLP tasks contain rich information of semantic relations,which are widespread in the unlabeled text data that can be easily collected in our world.Thus,there will be huge differences between tasks of NLP and tasks in other fields.In this paper,we propose four label distribution learning methods which can fully extract semantic features from instances for NLP tasks.These methods express the semantic relationship between instances and labels by calculating the semantic relationship between the instances and words,and the relationship between labels and words.The first method is based on the TF-IDF,which is a classical algorithm in the field of information extraction.The second method is based on the BM25 which is improved on the TF-IDF.As the word frequency increases,the growth rate of correlation between labels and instances slows down,and the correlation between word frequency and correlation between labels and instances is not a linear correlation,the second method involves multiple parameters and growth limit to word frequency.The second method also considers the number of words associated with the instances,the length of the instances and so on,which are not considered in the first method.The third method improves on the second method.As the first two methods completely ignore the syntactic structure of natural language,and there are huge unlabeled text data in the real world,which contains rich syntactic structure features.The third method pre train the word representations on a large number of unlabeled corpus,which makes full use of a large number of corpora in the real world and greatly improves the defects of the first two methods.The fourth method introduce thesaurus by GNN to enhance word embedding and generate label distribution.The research shows that the three methods proposed in this paper can effectively enhance the classification effect of the model in natural language processing tasks.In this paper,the effectiveness of the three methods is verified on text classification,emotional analysis and named entity recognition tasks,which include eight datasets.In order to explore the effectiveness of label distribution learning on imbalanced datasets,four imbalanced datasets are constructed manually and tested.The results show that the label distribution learning methods proposed in this paper can enhance models' accuracy on imbalanced datasets. |