A Rough Set Approach To Chinese Documents Classification And Retrieval

Posted on:2005-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:X W Sheng

Full Text:PDF

GTID:2155360152967857

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

As we strides into the information age, we are confronted with vast digital data on the internet, among which mostly are text data. Therefore, a good automatic classification and retrieval system for text data is required in order to fetch useful information. Nowadays, automatic text classification and retrieval has been a hot spot in the natural language processing field.Most of the term weighting algorithms used in TC only calculate term frequency, inverse document frequency and so on, but ignore the distribution information among different classes and the relevance of this distribution with classification decision, which inevitably affects the performance of many recent TC systems.Moreover, automatic text classifications are inseparable with the construction of document vectors. With each term corresponding to a unit in the vector, this method maps the input vectors into a very high dimensional space, possibly of ten-thousand dimensions, which results in a massive amount of calculation. Thus an effective reduction algorithm for document vectors are indispensable, while those traditional algorithms based on frequency and threshold filtering may often lead to the loss of many effective information.By introducing a mathematical theoretical tool - Rough Set, this paper presents a brand-new classification system. Rough set theory can be efficient in both the term weighting and reduction of dimensionality. In addition, a basic retrieval function can also be realized using Rough Set theory. The experimental result at the end proves that weighting with rough set can effectively separate the terms according to their importance by calculating disperse weight, and avoid the by-effect of frequency factor. Rough Set weighting is efficient for classification because it can restrain those unimportant high frequency terms, and at the same time rising those important low frequency terms. Reduction with rough set can avoid the large information loss which is always the problem of many traditional threshold algorithms. It can cut down calculation scale by shrinking the document vectors' scale and ensure the accuracy of classification.

Keywords/Search Tags:

Automatic Text Classification, Rough Set Theory, Decision Table, Reduce Algorithm, Information Retrieval.

PDF Full Text Request

Related items

1	Improvement Of Rough Set Method For Cognitive Diagnosis
2	Application Of Rough Set Theory In Cognitive Diagnosis
3	Research On Automatic Transcription Algorithm Of Piano Music Based On CNN-HMM
4	Research On Music Information Retrieval Algorithm Based On Deep Learning
5	Rough K-means Clustering Algorithm And Its Application In Cultural Relics Health Evaluation
6	Research And Implementation Of Music Retrieval System Based On Humming
7	A Research On Automatic Classification Of Mongolian Text
8	Hesitant Fuzzy Decision-making And Game Methods And Applications Based On The Psychological Behavior Of Decision Makers
9	Research On Information Classification And Algorithmic Annotation Under The Background Of Content Industry Outbreak
10	The Research And Implementation Of The Tibetan Textual Automatic Classification Based On The Web