Research On Semi-supervised Text Categorization Method Based On EM Algorithm

Posted on:2011-05-14

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Guo

Full Text:PDF

GTID:2178360308454510

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet and the emergence of a large number of texts, automatic text categorization has become a research hotspot. In order to improve the performance of text classifier, we usually require lots of labeled training samples. However, it is often fairly difficult to collect many labeled samples since labels are manually assigned by experienced analysts, which is a time-consuming and labor-intensive work. In contrast, lots of unlabeled samples can be easily collected. The technology of semi-supervised learning can effectively resolve the bottleneck of labeling texts, which is implemented by combining a large amount of unlabeled samples with small amount of labeled samples. As a result, many researchers pay more attention to semi-supervised text categorization.EM-based semi-supervised text classification makes use of unlabeled samples to improve the classification performance, which is implemented by constructing a classifier with all labeled and unlabeled samples. Because the initial labeled training samples are a few, and the classification performance is not well, some unlabeled samples are easily misclassified by the current classifier. These misclassified samples disrupt the normal process of learning and reduce the classification performance to some extent. To solve the problem, an improved EM-based semi-supervised text classification method based on data reconstruction is proposed. According to the clustering hypothesis of semi-supervised learning, this method makes use of the neighboring relations between labeled and unlabeled samples to reconstruct the training set, which is implemented by data editing and ensemble learning. Our experiments prove that the classification performance is improved.In addition, once the unreliable information is added, it will be negative to the classifier and make the useful information of unlabeled training samples be not fully utilized. To solve this problem, a semi-supervised text classification method based on incremental EM algorithm is put forward. This method makes full use of the information of intermediate classifier, and adds the unlabeled samples incrementally to labeled samples. It is implemented by using the division mechanism to divide the unlabeled samples and the mechanism of feedback learning to amend the information of those incremental samples. As a result, it improves the reliability of incremental samples and the classification performance is improved. The experimental results prove that our method is feasible.

Keywords/Search Tags:

Text categorization, Semi-supervised learning, EM algorithm, Naive Bayesian

PDF Full Text Request

Related items

1	Research On Short Text Categorization Based On Semi-Supervised Learning
2	Semi-supervised Text Categorization Technology Research Based On The Semantic Analysis
3	Design And Realization Of Text Categorization System
4	A Study On Chinese Text Categorization
5	A Study On Text Categorization Based On Machine Learning
6	Research On Technology Of Intrusion Detection Based On Improved Naive Bayesian Algorithm
7	Research On Short Text Categorization Based On Phrase-Like Repeat And Semi-Supervised Learning
8	Semi-supervised Learning On Text Data
9	Research On Semi-Supervised Learning Algorithms Based On Bayesian Method
10	Semi-supervised Learning Based On Information Theory And Functional Dependency Rules Of Probability