Font Size: a A A

Research On Text Representation Method For News Classification

Posted on:2021-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2518306452473924Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Text representation is one of the core research directions of natural language processing.In recent years,with the development of machine learning,text representation has also been combined with machine learning from the initial statistical method.Human understanding of text depends on the world’s cognition and the complex logical processing of the brain,while computers depend on the binary representation of text stored inside the computer.How to express text into a form that is easier to understand by computers has become a hot research topic.The core work of this thesis is to construct a text representation for Chinese text and apply it to news text classification tasks.How to take Chinese characters into meaning and include emotion into text representation is the main research content of this thesis;how to take the sequence form of text into text representation is the main research content of this thesis;how to take context and semantics of text into text representation It is the main research content of this article.Based on the above content,the specific innovations and work of this thesis are as follows:Firstly,a text representation method based on granularity fusion is proposed.The pretrained word vector and the randomly initialized character vector are fused in the vector representation space.The character vector is used to strengthen the semantics and emotions in the word.The word vector is used to make up the character vector.Physical defects.The proposed method is combined with Text CNN to build a model,and experiments are conducted on public data sets,compared with many other advanced algorithms.The experimental results show that the text representation method of granularity fusion can effectively combine the advantages of two granularities of characters and words to construct a more effective text representation.Secondly,three methods of position coding,multi-head self-attention mechanism and twoway long-term short-term memory neural network are introduced,and the three methods are fused to construct a text sequence representation method.Give position information to the words in the text through position coding;give weight to the degree of mutual attention between words in the text through the multi-head self-attention mechanism;extract the contextual context and semantic information of the text through the two-way long-term and short-term memory neural network.The three methods are fused and combined with the Text CNN model to conduct experiments on public data sets and compare with many other advanced algorithms.The experimental results show that the text sequence representation method can effectively represent the sequence information in the text and enhance the text representation.Finally,a news classification model based on GMSC-Text CNN is proposed,which combines the above text representation methods with Text CNN and applies it to the short text news headline classification of Toutiao today.GMSC-Text CNN can incorporate a variety of information in text composition to enhance the effective features in text representation.Experiments show that GMSC-Text CNN can greatly improve Text CNN’s classification ability in short text.
Keywords/Search Tags:natural language processing, text representation, granularity fusion, attention mechanism, contextual information
PDF Full Text Request
Related items