Font Size: a A A

Content-based Anti-Spam Filtering

Posted on:2009-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:D LiFull Text:PDF
GTID:2178360245971658Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Electronic mail (E-mail) is becoming the most important communication way among the modem people, with the network and communication technology becoming advanced. But spam brings inconvenience to our lives and extremely bad impact to the security of the network. Solving the problem of spam is urgent. Today, content-based on machine learning methods have been introduced into current spam filters. However, the proposed state of the art classification methods often abandon them when meeting the large number of unlabeled training samples, which bring up heavy overhead of time and decrease the classification accuracy. Therefore, a research on the anti-spam filter with these unlabeled training samples is proposed in this dissertation. The major contributions are as following:(1) A research on the proper feature reduction method of the Emails.The performance will be worse while there are too many dimensions of attribute vectors in the Emailes.So it is necessary to reduce dimensions. Several feature reduction methods usually used in text categorization are experimentized separately. According to the results, X~2 statistic and Expected-Cross-Entropy are the most useful methods to reduce dimensions. Information-gain and Multi-Information are less effective.(2) A naive Bayes method based on active learning (RANB),which can improve the performance of training samples,is proposed in this dissertation,. The RANB method adopts the strategies of conditional entropy and classification loss to restrict the error from unlabeled samples effectively. The experimental study shows that the RANB method only needs fewer samples to learn under the condition of ensuring the capability of classification in comparison with the classical methods.(3) A spam filter system called ALNBSpamFilter based on RANB, which is used for pretreatment to label the classes of the unlabeled training samples, is designed and constructed in this dissertation. The result shows that the system used RANB method can effectively improve the quality of training samples. Meanwhile, the stability means ALNBSpamFilter has good applied foreground.
Keywords/Search Tags:spam, machine learning, text categorization, naive Bayes classification, active learning
PDF Full Text Request
Related items