Font Size: a A A

Text Classification Algorithm

Posted on:2003-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:A YangFull Text:PDF
GTID:2208360065450735Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of WWW, the number of documents on the internet increases exponentially. One important research focus on how to deal with these great capacity of online documents. Auto text classification is one crucial part of information management.This paper mainly focus on the text classification algorithms. The algorithms of text classification are supervised, which means the classifier training need some human labeled data of fixed classes. Generally, the accuracy of classifier is higher with more labeled data. But the labeled data by hand are expensive resource. One vital problem with text classification is how to reduce the number of labeled data while maintain the proper accuracy. This paper partly solves this problem from two different aspects.Firstly, we want to deal with sparse training data by selecting high performance algorithm. This paper proposes a novel text classification algorithm, k nearest feature line algorithm, based on nearest feature line algorithm which is proposed in face recognition. The experiments show that this algorithm can deal with sparse train data with rather high accuracy.From another point, there are a great number of unlabeled documents available online. This paper approach to a novel algorithm, called iterative TFIDF, which combines a large number of unlabeled data with small labeled data to train the TFIDF classifier. The iterative TFIDF reduces the number of labeled documents. Under the same experiment data, experiment results show this algorithm has higher accuracy than EM Bayes text classification. Iterative TFIDF algorithm belongs to hill-climbing algorithm, it has the common problem of converging to local optimal value and sensitive to initial point.To deal with local optimal value problem, we introduce active learning technology to reduce the converging speed to local optimal value. The results show this rejoin is helpful, active learning reduces the classification bias and boosts the accuracy.
Keywords/Search Tags:text classification, EM algorithm, TFIDF algorithm, kNFL algorithm
PDF Full Text Request
Related items