Font Size: a A A

Chinese Text Classification Based On Statistical Method

Posted on:2018-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y YinFull Text:PDF
GTID:2347330518483225Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Text classification is used for classification model features of text to match.The process is:the expression of the text,the establishment of classification,evaluation and prediction results.At present,there are some mature classification algorithms,such as Naive Bayesian,neural network,use English text class better,but,sometimes the effect is not good in Chinese text classification.The reason lies in the difference between English words and Chinese words.In English document,there are spaces between words,so you can complete easily in word processing;but Chinese document,there is no interval between words,and different combinations of words,its meaning will be great difference.Now,based on the understanding of the word,word segmentation based on string matching and word segmentation based on statistical method is a method to solve the Chinese word commonly used.In this paper,the classification of Chinese text is studied,try some ideas to improve the accuracy of classification,and the corresponding experiment.This article from the Sogou Laboratory download 480 Chinese text documents,one of the 400 categories of documents are known,finance,health,education and military class;the other 80 text categories require automatic classification by computer.This paper expounds the principle and characteristics of several kinds of classification algorithms,and then the Chinese text document word segmentation based on statistical method,after the removal of stop words,TF-IDF extracts the features based on the classification according to the features,and compare various classification methods directly.This paper uses the KNN classification algorithm,SVM classification algorithm and combination learning method for text classification,and compares various classification algorithms directly.Classification algorithm models the accuracy rate can reach above 80%,the classification of random forest model combination learning method in the highest accuracy rate reached 92.5%.
Keywords/Search Tags:text classification, Chinese word segmentation, TF-IDF, KNN, SVM, Combinatorial learning
PDF Full Text Request
Related items