Font Size: a A A

Text Classification Using Sentential Frequent Itemsets

Posted on:2007-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:S Z LiuFull Text:PDF
GTID:2178360242461899Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text classification(TC—also known as text categorization, or topic spotting), the activity of labeling natural language texts with thematic categories from a predefined set, is becoming a major subfield of the information systems discipline in the early'90s.Text classification techniques mostly rely on single term analysis of the document data set, while more concepts especially the specific ones are usually conveyed by set of terms. To achieve more accurate text classifier, more informative feature including frequent co-occurring words in the same sentence and their weights are particularly important in such scenarios. In this paper, we propose a novel approach using sentential frequent itemset(SFI), a concept comes from association rule mining, for text classification, which views a sentence rather than a document as a transaction, and uses a variable precision rough set based method to evaluate each sentential frequent itemset's contribution to the classification.By merging SFIs of documents which belong to the same category, we get the features of that category spontaneously. For the number of SFI concerning with a category could be very large, how to calculate each SFI's global weight, SFI's contribution to the classification, is the key problem. we propose a weighting scheme based on variable precision rough set model to evaluate each SFI's global weight, on which we can select the SFIs for each topic template.Experiments over the Reuters corpus are carried out, which validate the practicability of the proposed system.
Keywords/Search Tags:Text Classification, Data Mining, Document Database, Variable Precision Rough Set
PDF Full Text Request
Related items