Font Size: a A A

Chinese Text Categorization Method And Implementation

Posted on:2017-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2347330503990901Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The popularity of Internet applications produced large amounts of unstructured text data, automatic classification system based on text data showing a great value. K-nearest neighbor classification algorithm is simple and intuitive, Naive Bayesian classification based on Bayes theory has a significant effect in text classification, which produces broad prospects for use in text categorization technology.This paper briefly introduces word segmentation, including mechanical segmentation, statistical segmentation and etc. Secondly, use vector space model to express segmentation results into vectors, the feature weights are calculated in variety of ways, including Boolean weights, frequency weights, counter-document frequency weights and so on. In order to solve high-dimensions problems in the text mining, this article use the CHI statistics and random forests Boruta algorithm for feature selection; CHI statistical methods provided a method to test the relative between variables and categorization; random forest Boruta algorithm combine a series of simple classification tree's results to vote a final categorization. this algorithm added shadow features for importance testing of original variables. Ultimately based on K-nearest neighbor classification algorithm instance and model Naive Bayes classification algorithm, this article compared the accuracy of text classification results.This article focuses on three aspects: First, using Boruta algorithm to solve to problem of high-dimensional, by comparison with CHI method,this article proved that the feature selection algorithm can reduce large dimension data; second, the number of neighbors undetermined caused poor classification performance, this article provided the effective search range of K; third, for the problem of post-probability zero, this article involved m estimated to investigate, which make an improvement of na?ve Bayesian classifiers effect.
Keywords/Search Tags:Text Classification, Feature Selection, Random Forest, K-nearest Neighbor Classification, Naive Bayes Classification Algorithm
PDF Full Text Request
Related items