Font Size: a A A

Research On Multi-label Text Classification With Long-tailed Distribution

Posted on:2023-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:L XiaoFull Text:PDF
GTID:1528306845997179Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of big data,Multi-Label Text classification(MLTC)becomes significantly challenging because it has to handle massive documents,words,and labels simultaneously.Therefore,it is an emergency to develop an effective multi-label text classifier for various practical applications.In traditional classification methods,each document belongs to only one label,that is called multiclass learning.But in real applications,many instances belong to multiple categories and have multiple labels at the same time.For example,sports news might be in both the ”Sports” category and the ”Olympics” category,as well as the ”Swimming”or ”Diving” categories.Such instances should be classified by using MLTC.MLTC has many practical applications in real life,such as topic recognition,sentiment analysis,and question answering systems.However,MLTC suffers from the challenge of highly skewed long-tailed label distribution.i.e.,a few labels are associated with a large number of documents(a.k.a.head labels),while a large fraction of labels is associated with a small number of documents(a.k.a.tail labels).The relative infrequency of tail labels leads to an imbalance that biases towards predicting more head labels.It is difficult to capture the information for tail labels.The number of instances in tail labels and head labels is imbalanced,an accurate classifier can be learned based on sufficient instances of head labels,but cannot obtain from insufficient instances of tail labels.In addition,few-shot instances lead to intra-class diversity limitation,and the learned subspace for tail labels is distorted.To tackle the problems of tail label semantic acquisition difficulty,data imbalance,and space distortion of tail labels,we design several models from the perspectives of introducting external information(label semantic or data topology)and transferring internal information.Considering the introducting of external information,a Label Specific Attention Network(LSAN)is proposed using label semantic information.Based on that,to tackle multi-label graph node classification,we further explore the topology of multi-label graph data and design a Label Aware Representation Network(LARN),where label semantics and data topology are used to help tail labels capture relevant information and improve the effectiveness of multilabel classification.For the transferring of internal information,we design the Head To Tail Network(HTTN)and Instance Correlation Transfer Network(ICTN),which solve the problem of the imbalance of data distribution and the lack of intra-class diversity on tail labels.The main contributions are as follows:1.Label Specific Attention Network(LSAN): For text classification,each label contains textual information.In the MLTC task,one document may contain multiple labels,and each label can be taken as one aspect or component of the document.Motivated by the above-mentioned observations,we propose a novel LSAN to learn document representation by sufficiently exploiting the document content and label content.To capture the label-related component from each document,we adopt the self-attention mechanism to measure the contribution of each word to each label.Meanwhile,LSAN takes advantage of label texts to embed each label into a vector-like word embedding,so that the semantic relations between document words and labels can be explicitly computed.Thereafter,an adaptive fusion strategy is designed to extract the proper amount of information from these two aspects and construct the label-specific representation for each document.2.Label Aware Representation Network(LARN): For semi-supervised few-shot multi-label node classification,taking advantage of the semantic knowledge of labels and data topology to characterize nodes,we propose LARN,which is a label-aware feature learning process that allows a node to prepare its represen-tation by knowing how it will be classified.The learned rich representations can combat the scarcity of labeled training nodes.A label correlation scanner is then proposed to adaptively capture the label correlation and extract useful information to generate the final node representation.3.Head To Tail Network(HTTN): To address the challenge of insufficient train-ing data on tail label classification,we propose a HTTN to transfer the meta-knowledge from the data-rich head labels to data-poor tail labels.The meta-knowledge is the mapping from few-shot network parameters to many-shot network parameters,which aims to promote the generalizability of tail clas-sifiers.Besides,a triple alliance prototype is proposed to learn a better label prototype for long-tailed multi-label text classification by adopting an Atten-tive Prototype with the aid of few-shot documents,label semantic information,and label correlation.4.Instance Correlation Transfer Network(ICTN): It is much more challenging to mine the hidden patterns from the data-poor tail labels than from the data-rich head labels.The main reason is that the head labels usually have sufficient information,e.g.,a large intra-class diversity,while the tail labels do not.In response,we propose an ICTN to augment tailed-label documents for balancing the tail labels and head labels.Meanwhile,two regularizers(diversity and consistency)are designed to constrain the generation process.Among them,the consistency-regularizer encourages the variance of tail labels to be close to head labels and further balances the whole datasets in the high-level feature space,which benefits the subsequent learning task.The diversity-regularizer makes sure the generated instances have diversity and avoids generating redundant instances.
Keywords/Search Tags:Multi-label learning, text classification, long-tailed distribution
PDF Full Text Request
Related items