Research On Text Categorization Algorithm Based On Category Homogenization

Posted on:2007-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:W Zheng

Full Text:PDF

GTID:2178360182477783

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Automatic text classification is defined as the task to assign pre-defined category labels to documents. Vector space model (VSM) is the model that is used widely in the large-scale text disposal.To improve the classification performance, this paper does research and improves on the classical algorithm of calculating the term weight in VSM. Furthermore, a category homogenizing-based k-nearest neighbors (kNN) algorithm for text categorization is proposed. The importance of terms in text categorization and the capability of reflecting text content of terms lie in different position in text are considered in the improvement on algorithm of calculating the term weight. Information entropy and position of terms are introduced in this approach based on TFÃ—IDF. This algorithm remedy TFÃ—IDF's defects that distribution proportion and information of position of term is neglected.The training set is organized newly with category as unit in the kNN algorithm based on category homogenization. Meanwhile, similar small categories less than the threshold are combined into new big categories and as sub- categories of new big categories, big categories more than the threshold are separated into some small categories and them as sub- categories of big categories. The center of sub- categories is considered as dot of categories. The document is categorized based on sum of cosine distance between it and dot of categories. Distribution of categories is homogenized in the organized training set. The problems of Multi-peak distribution,overlap boundary and small categories are neglected are resolved by categorizing in category homogenizing-based training set.

Keywords/Search Tags:

text categorization, VSM, feature selection, term weight, category homogenization

PDF Full Text Request

Related items

1	The Method Of Text Categorization Scheme Selection And Development Of A Prototype System
2	Research And Application On Feature Selection Algorithms Based On Term Distributions In Text Categorization
3	The Research Of Text Representation And Feature Selection In Text Categorization
4	Research Of Feature Selection Based On Comprehensive Measure In Text Categorization
5	Feature Selection Methods For Text Categorization
6	Normal Weight Based Feature Selection Method In SVM Text Categorization
7	Research On High-Performance Feature Selection And Text Categorization
8	Research On Text Categorization And Technologies
9	Design And Implementation Of Kazak Text Categorization System
10	Research On Chinese Text Categorization Algorithms Based On Technology Text