Font Size: a A A

Based Multi-class Chinese Text Automatic Classification

Posted on:2003-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LuFull Text:PDF
GTID:2208360092499097Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the application and popularization of the computer and Internet technology, the data and information obtained through various channels is increasing at a fantastic speed, and the contradiction between "abundant data and usable information" comes to prominence. How to find quickly and effectively, and position accurately the useful information while eliminating the useless and irrelevant contents out of such a large amount of information has become a bottleneck of knowledge acquisition and information filtering, which is the mainstream technology in the field of information development and processing.This very thesis focuses on the discussion of the automatic classification methods of Chinese texts on the basis of machine learning. The basic conception of machine learning is to load the human knowledge and methods as well as the knowledge concerning the objects to be recognized by classification into the computer, which works out the rules of classified recognition and the programs of analysis; the automatic classification of the text is to judge on the text unclassified in accordance with the rules of recognition and the programs of analysis, aiming at classifying the text. The classifier is the core of the classifying system, which can be improved through machine learning whenever necessary.Through discussing such core technologies in the automatic processing of Chinese information as automatic word segmentation, feature selecting and automatic representation of texts, the thesis makes some improvements and perfection on the current methods of automatic word segmentation and text space reduction of Chinese texts, therefore improved their efficiencies and effects. With regard to the methods of text classification, the paper introduced two supervisory automatic classification methods of Chinese texts based on multi-classification, i.e. fuzzy clustering and boosting, which settled the problem of low percentage of recall. Through comparing the results of experiments with the two methods, an automatic classification system of multi-classification texts is constructed based on the boosting method, which received good effects in application and provides a good resolution to the problem of real-time classification of information.
Keywords/Search Tags:multi-classification, machine learning, word segmentation, term space reduction (TSR), test representation, fuzzy clustering, classifier
PDF Full Text Request
Related items