Research On Text Representation Model And Deep Learning Algorithm In Text Classification

Posted on:2020-04-30

Degree:Master

Type:Thesis

Country:China

Candidate:T F Li

Full Text:PDF

GTID:2428330575492707

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,textual information based on the Internet has also shown an explosive growth trend.The management and classification of these massive amounts of data by means of labor will not only cost a lot of manpower and time,but also be difficult to achieve.Therefore,how to organize and manage these text information efficiently is a hot topic in the field of natural language processing,which also promotes the rapid development of automatic text classification technology.At present,in the fields of text mining,information filtering and retrieval,automatic text classification technology has been widely used.Automatic text categorization is a technique that involves knowledge of many fields,such as machine learning algorithms,optimization theory,and natural language processing.Therefore,many factors affect the performance of automatic text categorization,such as text preprocessing,text representation model selection,feature dimensionality reduction algorithm,text classifier design,and so on.Among the many influencing factors,the text representation model and the design of the text classifier are two research hotspots in the field of automatic text classification.This paper first discusses the research background and significance of text categorization,analyzes research trends and hotspots at home and abroad,and clarifies the concrete implementation of each process of text categorization.On this basis,this thesis mainly conducts in-depth research on feature dimension reduction,text representation and the application of deep learning in text categorization,and has achieved the following results:(1)A feature clustering algorithm based on neural network language model is proposed: NNLM-FC.Aiming at the problem that the semantics of word vector is missing in the traditional vector space model,the dimension is too high,and there are a large number of synonyms and synonyms in the feature set,the neural network language model is used to transform the feature words into low-dimensional semantic vectors,using K-means clustering algorithm.The semantic similar feature words are clustered,the chisquare statistical algorithm is used to calculate the chi-square statistic of each feature word,and the feature words with large chi-square statistic in the cluster cluster are selected for text representation.Finally,a feature clustering algorithm based on neural network language model(NNLM-FC)is obtained.Naive Bayesian,support vector machine and K-nearest neighbor classifier are used on the Fudan University corpus and web crawler dataset.Using the correct rate and value of the classification results as a metric,a comprehensive comparison with common feature selection algorithms.The experimental results show that the proposed algorithm not only can effectively reduce the dimension of vector space,but also improve the performance of text classification.(2)A deep learning text classification model based on weighted word vector is proposed.Aiming at the problem that the traditional deep learning model can't distinguish the importance of word vector well and the CNN model discards a lot of useful features and is not suitable for processing serialized text,a new feature weight computing method(TDC)is proposed.The vector is weighted and the feature words with low importance are removed,thereby reducing the dimension of the deep learning input matrix.Next,the CNN model is combined with the LSTM model.The CNN model is used to extract the rich features in the text.Combining the advantages of the LSTM model to process the sequence data,and using the weighted word vector as input,the deep learning model based on the weighted word vector is finally obtained:WCNN-LSTM.Experiments on the Stanford Sentiment Treebank and Movie Reviews datasets demonstrate that the classification performance of the W-CNN-LSTM model is superior to the traditional deep learning model.

Keywords/Search Tags:

Text Classification, Neural Network Language Model, Feature Clustering, Feature Weight, Deep Learning

PDF Full Text Request

Related items

1	Research On Text Sentiment Classification Based On Language Model And Machine Learning
2	Research On Text Clustering Algorithm Based On Deep Learning Feature Extraction
3	Research On Network Text Sentiment Classification Based On Deep Learning
4	Short Text Classification Algorithm Of Deep-learning Based On Feature Extension
5	Research On Text Classification Method Based On Machine Learning
6	Research On Application Of Deep Convolutional Neural Network Models For Feature Extraction And Classification
7	A Research Of Text Sentiment Classification Based On Deep Learning
8	Research On Short Text Classification Based On Deep Neural Network
9	A Study On Optimization Of Text Clustering Based On Convolutional Neural Network
10	Research On Text Classification Based On Deep Neural Network