Font Size: a A A

Text Classification Based On Wikipedia Knowledge

Posted on:2014-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:C SuFull Text:PDF
GTID:2268330401973731Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With internet technology developing fast, the amount of on-line information is alsogrowing. How to organize and manage the information and mine out valuable informationhave drawn many researchers’ attention. For traditional classifier was build on large amountof labeled samples, while the labeling was done by people and cost much, researchers comeup with some semi-supervised learning methods to reduce the cost of labeling and enhancethe performance. Since most of data are unlabeled in reality, how to mine valuableinformation from large amount of unlabeled samples and build a high-performance classifieris the focus of researchers.This paper researches on using the provided keywords and Wikipedia informationeffectively to assist labeling, in order to construct a good-quality classifier by only usingWikipedia background knowledge and unlabeled training samples.The main topics of the paper are listed as following:(1) Extract the related wiki documents.Considering the case that there were not any positive and negative samples, This studyachieved the goal about extracting related Wikipedia knowledge of the provided keywords,which based on JWPL package and the extracting idea proposed in this paper..(2) Label the unlabeled documents with the related documents.As for the obtained related Wikipedia documents, this study realized labeling theunlabeled documents with the related wiki documents. Regarding the feature selection as theentry point, the labeling process was completed via initial and extending labeling. In theprocedure, based on plenty of classical algorithms, new labeling strategies were proposed toprovide initial labeled training samples for constructing a classifier.(3) Use SVM algorithm to build a text classifier.After obtaining the positive and negative samples, this research realized constructingSVM classifiers iteratively, at last chose the best one as the final classifier.The classifier will be tested by experimental data to grade its performance. The resultsshow that the related wiki documents extracted from the Wikipedia database have a highsimilarity with the positive labels hide in the unlabeled samples, and that the text classifierbuilt on the keywords, Wikipedia database and unlabeled samples has a good performance and steadiness. Thus, we could draw the conclusion that text classification based onWikipedia can improve the text classifiers’ convenience, providing practical value.
Keywords/Search Tags:text classification, Wikipedia, classifier, labeling sample
PDF Full Text Request
Related items