Text classification is a fundamental and important task in the field of natural language processing,which is widely used in news recommendation,search engine,spam detection,sentiment analysis and so on.In recent years,experts at home and abroad have achieved fruitful results in classifying long texts,but progress in classifying short texts has been slow.In this thesis,we propose the following two innovations to improve the classification of new words or words with new meanings after adding quotation marks(collectively called "noisy words" in this thesis)and the problem that text convolutional neural networks do not capture features comprehensively in short texts.First,because the news headlines are readable,there are other important words in the news headlines that can play a role in explaining the "noise words",then find the words that are important to the classification results to replace the "noise words",so as to reduce the interference of the "noise words" to the classification.The TF-IDF algorithm is an algorithm to measure the importance of words to classification results,but the traditional TF-IDF algorithm does not consider the distribution of feature words between classes,nor does it consider the distribution of feature words on each text within the class.In this thesis,we propose the TF-IDF-BI algorithm.For the convenience of illustration,we propose "betweenclass factor" and "intra-class factor" to measure the importance of feature words on text classification when they are between classes and within classes respectively,and derive the corresponding calculation formula,and finally the "between-class factor " and "intra-class factor" are incorporated into the TF-IDF algorithm to form the TF-IDF-BI algorithm.Second,aiming at the problem that the traditional text convolutional neural network can only consider local features while ignoring global features in the short title text,this thesis proposes the Pre Info CNN neural network model on the basis of Text CNN,which uses the output of Long Short-Term Memory Network at each moment to form the semantic information matrix of the previous text.Then,the above semantic information matrix is fused with the convolution result,so that the model can also capture the above semantic information during the convolution operation,and the K-MAX pooling is used to replace the maximum pooling,so that the model retains more semantic features.Finally,the model is sent into the softmax function for classification,and the classification result is obtained.Experiments show that the improved TF-IDF-BI algorithm is 1.26% higher than the traditional TF-IDF algorithm in the final classification accuracy.In terms of model improvement,the Pre Info CNN model proposed in this thesis is 1.24% higher than the Text CNN model,thus verifying that the two innovations proposed in this thesis can effectively improve the accuracy of text classification.By applying the above research results to the text classification system of news headlines,the improved word vector and Pre Info CNN model can classify short text of news headlines more accurately,which can better guide journalists to classify news texts. |