Research On Multi-label Text Classification With Long-tailed Distribution

Posted on:2023-12-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Xiao

Full Text:PDF

GTID:1528306845997179

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the explosive growth of big data,Multi-Label Text classification(MLTC)becomes significantly challenging because it has to handle massive documents,words,and labels simultaneously.Therefore,it is an emergency to develop an effective multi-label text classifier for various practical applications.In traditional classification methods,each document belongs to only one label,that is called multiclass learning.But in real applications,many instances belong to multiple categories and have multiple labels at the same time.For example,sports news might be in both the ”Sports” category and the ”Olympics” category,as well as the ”Swimming”or ”Diving” categories.Such instances should be classified by using MLTC.MLTC has many practical applications in real life,such as topic recognition,sentiment analysis,and question answering systems.However,MLTC suffers from the challenge of highly skewed long-tailed label distribution.i.e.,a few labels are associated with a large number of documents(a.k.a.head labels),while a large fraction of labels is associated with a small number of documents(a.k.a.tail labels).The relative infrequency of tail labels leads to an imbalance that biases towards predicting more head labels.It is difficult to capture the information for tail labels.The number of instances in tail labels and head labels is imbalanced,an accurate classifier can be learned based on sufficient instances of head labels,but cannot obtain from insufficient instances of tail labels.In addition,few-shot instances lead to intra-class diversity limitation,and the learned subspace for tail labels is distorted.To tackle the problems of tail label semantic acquisition difficulty,data imbalance,and space distortion of tail labels,we design several models from the perspectives of introducting external information(label semantic or data topology)and transferring internal information.Considering the introducting of external information,a Label Specific Attention Network(LSAN)is proposed using label semantic information.Based on that,to tackle multi-label graph node classification,we further explore the topology of multi-label graph data and design a Label Aware Representation Network(LARN),where label semantics and data topology are used to help tail labels capture relevant information and improve the effectiveness of multilabel classification.For the transferring of internal information,we design the Head To Tail Network(HTTN)and Instance Correlation Transfer Network(ICTN),which solve the problem of the imbalance of data distribution and the lack of intra-class diversity on tail labels.The main contributions are as follows:1.Label Specific Attention Network(LSAN): For text classification,each label contains textual information.In the MLTC task,one document may contain multiple labels,and each label can be taken as one aspect or component of the document.Motivated by the above-mentioned observations,we propose a novel LSAN to learn document representation by sufficiently exploiting the document content and label content.To capture the label-related component from each document,we adopt the self-attention mechanism to measure the contribution of each word to each label.Meanwhile,LSAN takes advantage of label texts to embed each label into a vector-like word embedding,so that the semantic relations between document words and labels can be explicitly computed.Thereafter,an adaptive fusion strategy is designed to extract the proper amount of information from these two aspects and construct the label-specific representation for each document.2.Label Aware Representation Network(LARN): For semi-supervised few-shot multi-label node classification,taking advantage of the semantic knowledge of labels and data topology to characterize nodes,we propose LARN,which is a label-aware feature learning process that allows a node to prepare its represen-tation by knowing how it will be classified.The learned rich representations can combat the scarcity of labeled training nodes.A label correlation scanner is then proposed to adaptively capture the label correlation and extract useful information to generate the final node representation.3.Head To Tail Network(HTTN): To address the challenge of insufficient train-ing data on tail label classification,we propose a HTTN to transfer the meta-knowledge from the data-rich head labels to data-poor tail labels.The meta-knowledge is the mapping from few-shot network parameters to many-shot network parameters,which aims to promote the generalizability of tail clas-sifiers.Besides,a triple alliance prototype is proposed to learn a better label prototype for long-tailed multi-label text classification by adopting an Atten-tive Prototype with the aid of few-shot documents,label semantic information,and label correlation.4.Instance Correlation Transfer Network(ICTN): It is much more challenging to mine the hidden patterns from the data-poor tail labels than from the data-rich head labels.The main reason is that the head labels usually have sufficient information,e.g.,a large intra-class diversity,while the tail labels do not.In response,we propose an ICTN to augment tailed-label documents for balancing the tail labels and head labels.Meanwhile,two regularizers(diversity and consistency)are designed to constrain the generation process.Among them,the consistency-regularizer encourages the variance of tail labels to be close to head labels and further balances the whole datasets in the high-level feature space,which benefits the subsequent learning task.The diversity-regularizer makes sure the generated instances have diversity and avoids generating redundant instances.

Keywords/Search Tags:

Multi-label learning, text classification, long-tailed distribution

PDF Full Text Request

Related items

1	Research On Feature Extraction Of Multi-label Text Classification
2	Long-tailed Data Classification Based On Multi-granularity Feature Optimization
3	A Research On Long-tailed Audio Event Classification Based On Transfer Learning
4	Research And Implementation Of Multi-label Text Classification Method For Threat Extraction
5	Research On Distantly-Supervised Long-Tailed Relation Extraction
6	Label Structure Based Deep Learning For Long-tail Distributed Classification
7	Multi-label Text Classification Based On Long Short-Term Memory
8	Research On Image Classification For Long-Tailed Data
9	Long-tailed Classification Method Based On Hierarchical Knowledge-driven
10	Research On Extreme Multi-label Text Classification Based On Label Knowledge