Font Size: a A A

Research On Text Representation Model And Deep Learning Algorithm In Text Classification

Posted on:2020-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:T F LiFull Text:PDF
GTID:2428330575492707Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,textual information based on the Internet has also shown an explosive growth trend.The management and classification of these massive amounts of data by means of labor will not only cost a lot of manpower and time,but also be difficult to achieve.Therefore,how to organize and manage these text information efficiently is a hot topic in the field of natural language processing,which also promotes the rapid development of automatic text classification technology.At present,in the fields of text mining,information filtering and retrieval,automatic text classification technology has been widely used.Automatic text categorization is a technique that involves knowledge of many fields,such as machine learning algorithms,optimization theory,and natural language processing.Therefore,many factors affect the performance of automatic text categorization,such as text preprocessing,text representation model selection,feature dimensionality reduction algorithm,text classifier design,and so on.Among the many influencing factors,the text representation model and the design of the text classifier are two research hotspots in the field of automatic text classification.This paper first discusses the research background and significance of text categorization,analyzes research trends and hotspots at home and abroad,and clarifies the concrete implementation of each process of text categorization.On this basis,this thesis mainly conducts in-depth research on feature dimension reduction,text representation and the application of deep learning in text categorization,and has achieved the following results:(1)A feature clustering algorithm based on neural network language model is proposed: NNLM-FC.Aiming at the problem that the semantics of word vector is missing in the traditional vector space model,the dimension is too high,and there are a large number of synonyms and synonyms in the feature set,the neural network language model is used to transform the feature words into low-dimensional semantic vectors,using K-means clustering algorithm.The semantic similar feature words are clustered,the chisquare statistical algorithm is used to calculate the chi-square statistic of each feature word,and the feature words with large chi-square statistic in the cluster cluster are selected for text representation.Finally,a feature clustering algorithm based on neural network language model(NNLM-FC)is obtained.Naive Bayesian,support vector machine and K-nearest neighbor classifier are used on the Fudan University corpus and web crawler dataset.Using the correct rate and value of the classification results as a metric,a comprehensive comparison with common feature selection algorithms.The experimental results show that the proposed algorithm not only can effectively reduce the dimension of vector space,but also improve the performance of text classification.(2)A deep learning text classification model based on weighted word vector is proposed.Aiming at the problem that the traditional deep learning model can't distinguish the importance of word vector well and the CNN model discards a lot of useful features and is not suitable for processing serialized text,a new feature weight computing method(TDC)is proposed.The vector is weighted and the feature words with low importance are removed,thereby reducing the dimension of the deep learning input matrix.Next,the CNN model is combined with the LSTM model.The CNN model is used to extract the rich features in the text.Combining the advantages of the LSTM model to process the sequence data,and using the weighted word vector as input,the deep learning model based on the weighted word vector is finally obtained:WCNN-LSTM.Experiments on the Stanford Sentiment Treebank and Movie Reviews datasets demonstrate that the classification performance of the W-CNN-LSTM model is superior to the traditional deep learning model.
Keywords/Search Tags:Text Classification, Neural Network Language Model, Feature Clustering, Feature Weight, Deep Learning
PDF Full Text Request
Related items