Font Size: a A A

Research On Text Classification Method For Multi-label Data Based On Deep Learning

Posted on:2022-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ChenFull Text:PDF
GTID:2518306524993359Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development and progress of society and the internet,various technologies in the field of natural language processing,such as machine translation,text matching,and text classification,have also been actively applied in the real society and have achieved good results.Due to the development of the Internet,the field of natural language processing is gradually facing more realistic data.A lot of noise in these data and each sample in the data may be marked by a combination of multiple subtags,and there are various categories in the data and the sample size of is not balanced.However,the development of today’s society is in need of supervision of web texts and online public opinion.So it is necessary to be able to deal with this kind of irregular and complex data.The text multi-classification of multi-label data studied in this thesis is dedicated to exploring the processing methods of this complex data from the various processes of natural language processing.The main research content consists of three parts: data preprocessing,text representation and classification model.(1)The data preprocessing process consists of two parts: data cleaning and data balancing.Data cleaning is mainly to delete some useless symbols in the network text,such as the html mark.At the same time,unlike traditional natural language processing tasks,in this thesis’ s cleaning process,this thesis leaves the punctuation marks in the text as a part of semantic expression.In the process of data balancing,this thesis combines the characteristics of Chinese text,the characteristics of the data set,and the characteristics of the up-and-down sampling method,combined with random exchange order,up-down sampling and other methods to balance the large and small categories.(2)Converting the input text into a matrix of real numbers that the computer can recognize is an indispensable part of natural language processing.Aiming at the hardware problems that may exist in the supervision of online public opinion,this thesis explores the text representation method of web text from the traditional embedding vector model and the BERT language model.In this thesis combining the characteristics of network text and the advantages and disadvantages of word vectors and word vectors,and based on the idea of word joint training,a word embedding vector model aw-char2 vec is proposed.At the same time,this thesis uses the BERT model as a text representation to connect to the classification network,which has achieved better results than directly using BERT as a classification model.(3)In order to make the text classification model work better in the web text,combined with the fact that there are many meaningless words in the network text,this thesis adds the attention mechanism into the RCNN,so that when it encodes the contextual meaning of the text,the model can be able to pay more attention to the part of the network text which can express more meaning.At the same time,according to the characteristics of the data set used in this thesis,a classifier similar to voting is designed by combining the label Powerset and the binary association method,and the relationship between the labels is incorporated into the training of the model.Good results have been achieved on the Weibo data set and Baidu question bank data set.
Keywords/Search Tags:NLP, Attention, Multi-Label, Unbalance, Web text
PDF Full Text Request
Related items