| Automatic text classification is defined as the task to assign pre-defined category labels to documents. Vector space model (VSM) is the model that is used widely in the large-scale text disposal.To improve the classification performance, this paper does research and improves on the classical algorithm of calculating the term weight in VSM. Furthermore, a category homogenizing-based k-nearest neighbors (kNN) algorithm for text categorization is proposed. The importance of terms in text categorization and the capability of reflecting text content of terms lie in different position in text are considered in the improvement on algorithm of calculating the term weight. Information entropy and position of terms are introduced in this approach based on TF×IDF. This algorithm remedy TF×IDF's defects that distribution proportion and information of position of term is neglected.The training set is organized newly with category as unit in the kNN algorithm based on category homogenization. Meanwhile, similar small categories less than the threshold are combined into new big categories and as sub- categories of new big categories, big categories more than the threshold are separated into some small categories and them as sub- categories of big categories. The center of sub- categories is considered as dot of categories. The document is categorized based on sum of cosine distance between it and dot of categories. Distribution of categories is homogenized in the organized training set. The problems of Multi-peak distribution,overlap boundary and small categories are neglected are resolved by categorizing in category homogenizing-based training set. |