Content-based Anti-Spam Filtering

Posted on:2009-02-08

Degree:Master

Type:Thesis

Country:China

Candidate:D Li

Full Text:PDF

GTID:2178360245971658

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Electronic mail (E-mail) is becoming the most important communication way among the modem people, with the network and communication technology becoming advanced. But spam brings inconvenience to our lives and extremely bad impact to the security of the network. Solving the problem of spam is urgent. Today, content-based on machine learning methods have been introduced into current spam filters. However, the proposed state of the art classification methods often abandon them when meeting the large number of unlabeled training samples, which bring up heavy overhead of time and decrease the classification accuracy. Therefore, a research on the anti-spam filter with these unlabeled training samples is proposed in this dissertation. The major contributions are as following:(1) A research on the proper feature reduction method of the Emails.The performance will be worse while there are too many dimensions of attribute vectors in the Emailes.So it is necessary to reduce dimensions. Several feature reduction methods usually used in text categorization are experimentized separately. According to the results, X~2 statistic and Expected-Cross-Entropy are the most useful methods to reduce dimensions. Information-gain and Multi-Information are less effective.(2) A naive Bayes method based on active learning (RANB),which can improve the performance of training samples,is proposed in this dissertation,. The RANB method adopts the strategies of conditional entropy and classification loss to restrict the error from unlabeled samples effectively. The experimental study shows that the RANB method only needs fewer samples to learn under the condition of ensuring the capability of classification in comparison with the classical methods.(3) A spam filter system called ALNBSpamFilter based on RANB, which is used for pretreatment to label the classes of the unlabeled training samples, is designed and constructed in this dissertation. The result shows that the system used RANB method can effectively improve the quality of training samples. Meanwhile, the stability means ALNBSpamFilter has good applied foreground.

Keywords/Search Tags:

spam, machine learning, text categorization, naive Bayes classification, active learning

PDF Full Text Request

Related items

1	Application Of Improved Naive Bayesalgorithm In Spam Filtering
2	A Study On Text Categorization Based On Machine Learning
3	Incremental Learning Of Naive Bayes Chinese Classification System
4	The Research And Application Of Text Categorization Arithmetic In Spam Filtering
5	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm
6	Research On Text Categorization Method By Active Multi-Field Learning For Spam Filtering
7	Text Categorization Based On Naive Bayes Method
8	Design And Implementation Of The Email Spam Detection System Based On Naive Bayes And SVM
9	Design And Implementation Of The Email Spam Detection System Based On Naive Bayes And Svm
10	Chinese WEB Document Automatic Categorization