Applications Of Information Gain Based Bayes Data Mining Algorithm In Spam Filtering

Posted on:2013-01-13

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhang

Full Text:PDF

GTID:2218330371461718

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In both daily life and business activities, e-mail has become an important communication tool. However, with the increasing development of e-mail, people are suffering from spam mail, which badly affecting the normal traffic communication. With the great development during the internet age of information, there have been many new changes of spam mail in content. These new spam mail has made the recognition rate of old guard not high enough as the old filtration system can not stop it, but the over-strict filtration system might misjudge the normal mail as spam. Therefore, the e-mail filtering system still has much room for improvement. The current bottleneck in spam mail filtering technology is not the enhancement of the interception rate, instead, how to reduce the false positive rate of the filtering system while maintaining a high interception rate at the same time should be under consideration.As is found that there are some important features for mail filtering system, learning from the normal mail retained by users. Thus this paper is focused on improving the efficiency of knowledge acquisition in Bayesian model, by mining features from spam and normal mail separately, so as to improve the effectiveness of Bayesian classification algorithm for mining. Combined with Markov chain approach, we propose a content-based e-mail spam mail filtering method. Moreover, the number of eigenvalue will affect the classification performance of the filter, but it would not necessarily be the best while selecting features by a fixed number. Hence in this paper, we proposed a feature selection method based on information gain, and thus improved the Bayesian model. We determine the optimal number of features through information gain calculation, to identify the most appropriate number of eigenvalue, so as to best optimize the effects of spam mail filtering. The experimental demonstration design of this paper was provided by TREC 2006 Chinese corpus, and the experimental results showed that the method could significantly differ spam from the normal mail, and could also effectively filter Chinese spam mail.

Keywords/Search Tags:

Spam mail, Bayesian classification, Information gain

PDF Full Text Request

Related items

1	An Intelligent And Integrated Method Of Spam Filtering With Double Engines
2	The Analysis And Implementation Of Spam-Filtering System Based On Bayesian Algorithm
3	Towards improving e -mail content classification for spam control: Architecture, abstraction, and strategies
4	Intranet Spam Monitoring Software Design
5	Email Security, Filtering And Inspection Techniques Studied
6	Research On Chinese Spam Filtering Technology
7	Research On E-mail Filter Based On Genetic Algorithm And Naive Bayes Classification
8	Text Classification Based On Bayesian Filtering Technology Research And Realization
9	Research And Implement On The Chinese Anti-Spam Engine Based On The Automatic Category
10	Research And Improvement Of Chinese Spam Emails Filtering Method Based On Bayesian Classification