Font Size: a A A

A Rough Set Approach To Chinese Documents Classification And Retrieval

Posted on:2005-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:X W ShengFull Text:PDF
GTID:2155360152967857Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
As we strides into the information age, we are confronted with vast digital data on the internet, among which mostly are text data. Therefore, a good automatic classification and retrieval system for text data is required in order to fetch useful information. Nowadays, automatic text classification and retrieval has been a hot spot in the natural language processing field.Most of the term weighting algorithms used in TC only calculate term frequency, inverse document frequency and so on, but ignore the distribution information among different classes and the relevance of this distribution with classification decision, which inevitably affects the performance of many recent TC systems.Moreover, automatic text classifications are inseparable with the construction of document vectors. With each term corresponding to a unit in the vector, this method maps the input vectors into a very high dimensional space, possibly of ten-thousand dimensions, which results in a massive amount of calculation. Thus an effective reduction algorithm for document vectors are indispensable, while those traditional algorithms based on frequency and threshold filtering may often lead to the loss of many effective information.By introducing a mathematical theoretical tool - Rough Set, this paper presents a brand-new classification system. Rough set theory can be efficient in both the term weighting and reduction of dimensionality. In addition, a basic retrieval function can also be realized using Rough Set theory. The experimental result at the end proves that weighting with rough set can effectively separate the terms according to their importance by calculating disperse weight, and avoid the by-effect of frequency factor. Rough Set weighting is efficient for classification because it can restrain those unimportant high frequency terms, and at the same time rising those important low frequency terms. Reduction with rough set can avoid the large information loss which is always the problem of many traditional threshold algorithms. It can cut down calculation scale by shrinking the document vectors' scale and ensure the accuracy of classification.
Keywords/Search Tags:Automatic Text Classification, Rough Set Theory, Decision Table, Reduce Algorithm, Information Retrieval.
PDF Full Text Request
Related items