Font Size: a A A

Applications Of Information Gain Based Bayes Data Mining Algorithm In Spam Filtering

Posted on:2013-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2218330371461718Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In both daily life and business activities, e-mail has become an important communication tool. However, with the increasing development of e-mail, people are suffering from spam mail, which badly affecting the normal traffic communication. With the great development during the internet age of information, there have been many new changes of spam mail in content. These new spam mail has made the recognition rate of old guard not high enough as the old filtration system can not stop it, but the over-strict filtration system might misjudge the normal mail as spam. Therefore, the e-mail filtering system still has much room for improvement. The current bottleneck in spam mail filtering technology is not the enhancement of the interception rate, instead, how to reduce the false positive rate of the filtering system while maintaining a high interception rate at the same time should be under consideration.As is found that there are some important features for mail filtering system, learning from the normal mail retained by users. Thus this paper is focused on improving the efficiency of knowledge acquisition in Bayesian model, by mining features from spam and normal mail separately, so as to improve the effectiveness of Bayesian classification algorithm for mining. Combined with Markov chain approach, we propose a content-based e-mail spam mail filtering method. Moreover, the number of eigenvalue will affect the classification performance of the filter, but it would not necessarily be the best while selecting features by a fixed number. Hence in this paper, we proposed a feature selection method based on information gain, and thus improved the Bayesian model. We determine the optimal number of features through information gain calculation, to identify the most appropriate number of eigenvalue, so as to best optimize the effects of spam mail filtering. The experimental demonstration design of this paper was provided by TREC 2006 Chinese corpus, and the experimental results showed that the method could significantly differ spam from the normal mail, and could also effectively filter Chinese spam mail.
Keywords/Search Tags:Spam mail, Bayesian classification, Information gain
PDF Full Text Request
Related items