Chinese Text Categorization Method And Implementation

Posted on:2017-11-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhou

Full Text:PDF

GTID:2347330503990901

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

The popularity of Internet applications produced large amounts of unstructured text data, automatic classification system based on text data showing a great value. K-nearest neighbor classification algorithm is simple and intuitive, Naive Bayesian classification based on Bayes theory has a significant effect in text classification, which produces broad prospects for use in text categorization technology.This paper briefly introduces word segmentation, including mechanical segmentation, statistical segmentation and etc. Secondly, use vector space model to express segmentation results into vectors, the feature weights are calculated in variety of ways, including Boolean weights, frequency weights, counter-document frequency weights and so on. In order to solve high-dimensions problems in the text mining, this article use the CHI statistics and random forests Boruta algorithm for feature selection; CHI statistical methods provided a method to test the relative between variables and categorization; random forest Boruta algorithm combine a series of simple classification tree's results to vote a final categorization. this algorithm added shadow features for importance testing of original variables. Ultimately based on K-nearest neighbor classification algorithm instance and model Naive Bayes classification algorithm, this article compared the accuracy of text classification results.This article focuses on three aspects: First, using Boruta algorithm to solve to problem of high-dimensional, by comparison with CHI method,this article proved that the feature selection algorithm can reduce large dimension data; second, the number of neighbors undetermined caused poor classification performance, this article provided the effective search range of K; third, for the problem of post-probability zero, this article involved m estimated to investigate, which make an improvement of na?ve Bayesian classifiers effect.

Keywords/Search Tags:

Text Classification, Feature Selection, Random Forest, K-nearest Neighbor Classification, Naive Bayes Classification Algorithm

PDF Full Text Request

Related items

1	Improved Naive Bayes Algorithm With Application To Text Classification
2	The Method Of Selecting Local Feature Words And Its Application In Text Classification
3	Feature Weighting Method For Binary Classification In Machine Learning
4	EEG Signal Classification Based On Iterative Random Forest Algorithm
5	Research On News Classification And Recommendation Method Of Taiyuan Education Bureau Government Affairs Big Data Platform
6	Several Classification Algorithms And Their Applications In Statistical Learning
7	Research On High Dimensional Imbalanced Data Classification Based On Random Forest
8	Research On Imbalanced News Text Mining Based On Improved Random Forest
9	Analysis And Research Based On Multivariate Statistics And Machine Learning
10	Statistical Classification Analysis For High-dimensional Data