Font Size: a A A

Semi-supervised Webpage Classification

Posted on:2014-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:2268330392469175Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the number of web pages grows rapidly. It’s an urgent demand to classify the mass web text and to find what we really need from it. However, the traditional supervised classifiers need lots of manual tagging. Based on enough labeled data we can train the models. Facing with massive information on the web, the cost of labeling is very high that we cannot afford, besides, users usually hold differing views on the same problem. How to obtain the information which we need fast and accurately becomes easier. Besides, with the rapid development of web data collecting and mining, more and more data can be used to solve the problems caused by massive network information. They are all basic technology we need. In the real world, we can easily find raw data from human editors while without any human labeling. Only a few of them are labeled. If we assume that they are drawn from the same distribution and the same feature space, we can combine them and create enough number of labeled training data. These date we create also have enough accuracy to describe the whole data. If it is true, our classifier can finally obtain high approximation precision and good generalization capability.In this paper, we try several mainstream and import semi-supervised classifiers. The training data is directly from the real web. In order to maximize the information in the web, after extracting text from the raw web page, only language selecting, advertising and short text removing was added. In order to improve the final effect, several feature selection and feature extraction methods was tried.On the evaluating of the classifiers, this paper tries three classifiers which based on completely different theory. The classic EM algorithm, the TSVM based on transductive learning and DBN based deep architecture. On the feature selection, this paper not only tries the classic feature selection methods, but also brings macro features in and achieves a performance improving.The ideas proposed in this paper are experimented on three data sets using three semi-supervised classifiers. Experiment results showed that semi-supervised classifiers with different features can obtain good results.
Keywords/Search Tags:text classification, webpage classification, semi-supervised, featureselection, feature extraction
PDF Full Text Request
Related items