Using background knowledge to improve text classification | | Posted on:2003-09-30 | Degree:Ph.D | Type:Dissertation | | University:Rutgers The State University of New Jersey - New Brunswick | Candidate:Zelikovitz, Sarah | Full Text:PDF | | GTID:1468390011979132 | Subject:Computer Science | | Abstract/Summary: | PDF Full Text Request | | Automatic text categorizers use a corpus of labeled textual strings or documents to assign the correct label to previously unseen strings or documents. Often the given set of labeled examples, or “training set”, is insufficient to solve this problem. Our approach to this problem has been to incorporate readily available information into the learning process to allow for the creation of more accurate classifiers. We term this additional information “background knowledge.”; We provide a framework for the incorporation of background knowledge into three distinct text classification learners. In the first approach we show that background knowledge can be used as a set of unlabeled examples in a generative model for text classification. Using the methodology of other researchers that treat the classes of unlabeled examples as missing values, we show that although this background knowledge may be of a different form and type than the training and test sets, it can still be quite useful. Secondly, we view the text classification task as one of information integration using WHIRL, a tool that combines database functionalities with techniques from the information-retrieval literature. We treat the labeled data, test set and background knowledge as three separate databases and use the background knowledge as a bridge to connect elements from the training set to the test set. In this way, training examples are related to a test example in the context of the background knowledge. Lastly, we use Latent Semantic Indexing in conjunction with background knowledge. In this case background knowledge is used with the labeled examples to create a new space in which the training and test examples are redescribed. This allows the system to incorporate information from the background knowledge in the similarity comparisons between training and test examples. | | Keywords/Search Tags: | Background knowledge, Text classification, Examples, Training, Labeled, Information | PDF Full Text Request | Related items |
| |
|