| With the increasing amount of text data,text classification is becoming more and more important.An automatic text classification technique is needed to classify text data,and extract valuable information from text data.In this research,we take Chinese news short text classification research as an example to complete the research work on Chinese short text classification.Short texts have less vocabulary and sparse features,so we cannot classify them according to conventional text classification methods.In order to increase the number of features of short texts and improve the accuracy of text classification,we present a short text feature extension method based on Latent Dirichlet Allocation(LDA)model and Text Rank algorithm.The main work and research results include:(1)We analyze in detail the advantages and disadvantages of Naive Bayes algorithm,Support Vector Model(SVM),K-Nearest Neighbor(KNN),Decision Tree algorithm and Logistic Regression algorithm.We use text classification techniques for data cleaning,word segmentation and feature processing of texts.We used the above five machine learning algorithms to conduct Chinese news short text classification experiments,and compared the classification results.(2)We present a short text feature extension method based on LDA model and Text Rank algorithm to solve the problem of sparse short text features.We first use the LDA model to obtain the hidden topic features of each text,then use the Text Rank algorithm to obtain the keywords of the text,and finally expand the keywords corresponding to the hidden topic features of the text into the short text as feature expansion words.Our method can increase the number of features in short texts,adding more effective information for subsequent text classification.(3)We take Chinese news short text classification as an example to conduct related research on Chinese short text classification,and improve the text classification method from the extraction and expansion of feature words.We use the Naive Bayes algorithm,SVM algorithm,KNN algorithm,Decision Tree algorithm and Logistic Regression algorithm to verify the improved method proposed in this thesis.We also used the Word2 Vec model to conduct verification experiments on the THUCNews dataset,in order to further verify the effectiveness of the method.The results show that this method can improve the accuracy of text classification and effectively improve the effect of text classification.We use the feature expansion method to expand the features of short texts,which can increase the number of text features and effectively alleviate the problem of sparse features of short texts.It has important research significance for realizing the correct classification of short texts. |