With the development of the technology of Internet advertisement and the popular-ity of E-mail,spam is becoming more and more serious.Based on the previous theories and researches,this paper systematically studied the method of spam classification and mainly analyzed the method of Naive Bayes classification in the problem of spam fil-tering.This paper firstly introduced spam from its definition,characteristics and harm and analyzed the current situation of spam at home and abroad.Then it introduced two kinds of spam filtering method based on spam’ s source and content.The method of Naive Bayes classification based on content has been widely used in the research of spam filtering,because it is efficient and easy to be implement.Then it introduced the key technologies of text classification,including text preprocessing,text feature selec-tion,text expression and text categorization algorithm.Finally,the experiments prove that the several improvements on the basis of Naive Bayes classification put forward in this paper improved the classification performance.In view of the importance of accurate classification and.the authenticity and au-thority of the data,this paper designed five groups of contrast experiments by using the data from Apache Spam Assassin Project.Experiment one established Bernoulli Naive Bayesian classification model by using the data without any treatment.The large amount of dictionary words leads to large calculating quantity of the joint probability distribution and it was beyond computers’ existing computing power.In the process of calculating the probability that the text belongs to a certain category,the probability would easily exceed the range of floating point and the calculation results would be zero and the classification accuracy would decrease.Therefore,this paper optimized the cal-culation process and calculated the ratio of the probability that the text was judged to be a normal mail and the probability of being judged as spam.And this method improved the classification accuracy from 88.3%to 92.3%.Although the ingenious settlement of calculation of the ratio made maximum use of the access directions of floating point,the ratio would also be zero or infinity.So it needed to reduce the text feature dimension.In experiment two,firstly,the stop words were removed according to the traditional method,and the results showed that the accuracy rate was reduced.This indicated that some of them have some contribution to the text classification,and then it turned to the method of feature extraction.Experiment three improved the method of feature extrac-tion of the mutual information and it proposed the concept of relative dependency,clas-sification ability and comprehensive classification ability.Compared with the method of mutual information,accuracy rate increased from 87.8%to 96.6%when the feature words were both about ten thousand.The improved method can extract the feature set which has the most comprehensive classification ability,but its classification ability is not the biggest for a given mail.So this paper made a deep discussion on this aspect.Experiment four improved the method of feature selection,which is called adaptive fea-ture selection and the result is that the classification accuracy is generally improved.In case of appropriate dimension of feature set,experiment five established a hidden par-ent node to describe a dependency between the attribute and other properties for each attribute in order to reduce the strict hypothesis that attributes are independent each other in Naive Bayesian which is called single hidden Naive Bayes.The experiment result showed that the classification accuracy has been improved.In order to improve the accuracy,all experiments were performed with ten fold cross validation. |