Font Size: a A A

Research On Text Classification And Short Text Clustering Technology Based On Contrastive Learning

Posted on:2024-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2568307094459434Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the big data era,text classification and clustering techniques have become extremely important.There is a large amount of textual information on the internet,such as Weibo comments,QQ chat messages,and movie reviews.Accurately classifying or clustering this textual information can help people better manage and utilize text data.Therefore,this article studies a text classification method based on contrastive learning and adversarial training,as well as a short text clustering method based on contrastive learning.The specific research work is as follows:(1)Although current text classification models achieve high accuracy in classification,they still suffer from poor data augmentation consistency and inability to learn noise-invariant representations.As a result,the models are not effective in resisting perturbations,have limited generalization abilities,and exhibit inconsistent prediction distributions for similar samples.These limitations severely impact the accuracy of text classification tasks.To address these issues,we propose a Text Classification model that combines Contrastive Learning and Adversarial Training(TCCA).The TCCA model first performs data augmentation on the original data to generate two samples,then adds adversarial perturbations to each of these samples to create a positive sample pair.The model then inputs the positive sample pairs into a Bert model to extract text features,and utilizes a Bi LSTM and Attention layer to extract deeper semantic information.By fusing Contrastive Learning and Adversarial Training,we construct a new loss function to optimize the model.During the prediction phase,we adjust the model’s predicted results using the empirical distribution of the training set to improve the classification accuracy.Compared to Bert,the TCCA model achieves a 0.67%,2.14%,and 1.77% improvement in accuracy on three datasets,respectively.(2)Due to the low information content and high category overlap of short text data,most short text clustering methods struggle to effectively separate the data.Additionally,the use of Transformer language representation models can lead to representation degradation,resulting in high cosine similarity between word vectors and impairing text semantic representation,which in turn negatively affects short text clustering performance.To address these issues,this paper proposes a Short Text Clustering(STCL)model based on contrastive learning.The STCL model effectively separates data from different categories using the contrastive learning approach,and to some extent,improves the representation degradation problem,thereby enhancing short text clustering performance.Experimental results demonstrate that on most datasets,the accuracy(Acc)and normalized mutual information(NMI)of STCL are significantly improved,achieving 91.7% and 75.2% on the Ag News dataset,respectively.The ablation experiments also confirm the effectiveness of contrastive learning in short text clustering tasks.
Keywords/Search Tags:Natural Language Processing, Text classification, Short text clustering, Contrastive learning, Adversarial training
PDF Full Text Request
Related items