Font Size: a A A

Research On Semi-supervised Text Categorization Method Based On EM Algorithm

Posted on:2011-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y GuoFull Text:PDF
GTID:2178360308454510Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and the emergence of a large number of texts, automatic text categorization has become a research hotspot. In order to improve the performance of text classifier, we usually require lots of labeled training samples. However, it is often fairly difficult to collect many labeled samples since labels are manually assigned by experienced analysts, which is a time-consuming and labor-intensive work. In contrast, lots of unlabeled samples can be easily collected. The technology of semi-supervised learning can effectively resolve the bottleneck of labeling texts, which is implemented by combining a large amount of unlabeled samples with small amount of labeled samples. As a result, many researchers pay more attention to semi-supervised text categorization.EM-based semi-supervised text classification makes use of unlabeled samples to improve the classification performance, which is implemented by constructing a classifier with all labeled and unlabeled samples. Because the initial labeled training samples are a few, and the classification performance is not well, some unlabeled samples are easily misclassified by the current classifier. These misclassified samples disrupt the normal process of learning and reduce the classification performance to some extent. To solve the problem, an improved EM-based semi-supervised text classification method based on data reconstruction is proposed. According to the clustering hypothesis of semi-supervised learning, this method makes use of the neighboring relations between labeled and unlabeled samples to reconstruct the training set, which is implemented by data editing and ensemble learning. Our experiments prove that the classification performance is improved.In addition, once the unreliable information is added, it will be negative to the classifier and make the useful information of unlabeled training samples be not fully utilized. To solve this problem, a semi-supervised text classification method based on incremental EM algorithm is put forward. This method makes full use of the information of intermediate classifier, and adds the unlabeled samples incrementally to labeled samples. It is implemented by using the division mechanism to divide the unlabeled samples and the mechanism of feedback learning to amend the information of those incremental samples. As a result, it improves the reliability of incremental samples and the classification performance is improved. The experimental results prove that our method is feasible.
Keywords/Search Tags:Text categorization, Semi-supervised learning, EM algorithm, Naive Bayesian
PDF Full Text Request
Related items