Font Size: a A A

Research On Short Text Classification Based On Deep Neural Network

Posted on:2022-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:S B WangFull Text:PDF
GTID:2518306509988969Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Internet technology has developed rapidly in recent years.Every major platform of the internet,such as Alibaba,Meituan life,Tiktok,and Headlines,etc.generates massive data every day.Text data information is one of them.With the help of big data analysis technology,mining the valuable information contained in it not only brings huge profits to the company,but also plays a very important role in social management,national security and other fields.Text classification is to extract the content features of an article through some model and automatically assign them to the defined categories.Text data is unstructured data,how to quantify the unstructured text data,extract its features and accurately classify it has become one of the most basic tasks in the field of artificial intelligence.With the rapid development of deep learning technology,deep neural network model has shown very good performances in the field of natural language processing.For text classification,this paper mainly studies some aspects of text representation,feature extraction,text classification model selection and integration implementation,and tries to improve some links..For unstructured text representation,common methods include bag of words(BOW),word frequency model(TF)and document inverse frequency(IDF).These methods can represent the relationship between the word and the whole text,but they will lose the information carried by the word itself.VSM has a good effect on long text processing,but the sparsity and irregularity of short text make the effect of VSM not ideal.Compared with VSM,Word2 Vec solves the problem of high dimension and sparsity in traditional text representation model.But it does not reflect the importance of feature words in the text.TF-IDF emphasizes the importance of small frequency words in the whole document library,but it does not take into account the information of uneven distribution of feature words between classes and uneven distribution of feature words in different articles within the same class.Therefore,it will have an adverse impact on the classification results.In this paper,based on TF-IDF,the introduction of class weighting factor and intra-class weighting factor can effectively solve the above problems and a new text representation method is formed by combining Word2 Vec model with improved TF-IDF weight.Convolutional neural network(CNN)and bi-directional long-term and short-term memory network(Bi LSTM)are classic deep neural networks for feature extraction,which have shown excellent performances in text processing and computer vision.However,CNN is more inclined to extract the local features of the text data,and can not capture the context features of the text data very well.LSTM is an improvement of the recurrent neural network(RNN),which solves the problem of gradient disappearance and gradient explosion.It has memory function,but its network design can only extract the above features of the text data,ignoring the below features of the text data.Bi LSTM is a clever combination of a forward LSTM and a reverse LSTM,which can effectively extract the context global feature information of the text data.However,Bi LSTM also has disadvantages in extracting local features of text data.Therefore,this paper combines Bi LSTM and CNN,with complementary advantages and disadvantages,which can more fully extract the feature information of the text data.In this paper,the TDFMIX model is constructed by improving the text representation and integrating the classification models.The comparison of different models on multiple corpora shows that the TDFMIX model improves the text classification performance.
Keywords/Search Tags:Text Classification, Improved TF-IDF Model, Feature Extraction, Convolution Neural Network, Bi LSTM
PDF Full Text Request
Related items