| With the development of Internet technology,e-mail system has gradually replaced the traditional mail communication system,and become an indispensable part of people’s daily life.But some people spread malicious information by the e-mails due to benefits.Part of these are the common commercial advertising,as well as others are about reactionary and fraud.The spread of these kinds of e-mails is not only detrimental to people’s daily lives,but also a threat to social security and stability.We focus on the content-based spam filtering technology in this paper.And some methods are presented to improve the traditional spam filtering technology in the spam detection process based on the analysis of the common spam filtering technology.The research work of this paper is divided into four parts.(1)The Review of Content-based Spam Filtering TechnologyThis paper reviews the content-based spam filtering technology from four aspects:feature representation,dimension reduction method,classification method and evaluation standard.The methods and tools used in all aspects are reviewed and sorted out.(2)Proposed Chinese Segmentation Based on Interval Sliding WindowText segmentation is an important part of spam filtering research.In order to avoid the detection of spam filtering technology,spammers reduce the efficiency of word segmentation by adding exception characters in the original message text and hiding the sensitive words in the text.In view of this situation,this paper proposes a chinese segmentation based on interval sliding window.This paper combines the interval sliding window with the word segmentation method.It first use the interval sliding window to filter the abnormal characters of the text.Then match the divided string and word segmentation dictionary,and extract the valid entry.At the same time,the method also increases the collection of text information.(3)Proposed a method of Mutual Information Feature Selection Based on Feature Contribution RatioWith the increasing amount of data,feature reduction is an indispensable part of text classification.At present,the common feature selection method is often aimed at the multi-classification problem,and can not effectively deal with the binary classification problem.Therefore,this paper has improved the traditional mutual information feature selection method.In addition to the problem of lacking word frequency information,the word frequency factor is introduced,and the concept of feature contribution ratio is introduced for the binary classification problem.Experiments show that the feature set obtained by a method of Mutual Information Feature Selection Based on Feature Contribution Ratio greatly improves the detection effect of spam.(4)Proposed ROC-SVM Algorithm Based on L1 Norm RegularizationIn recent years,the study of the unbalanced problem has become a hot spot.Since the number of samples of different categories of data collected from actual life often varies,the classification has caused a lot of problems.Especially for the detection of a small number of samples,recognition is very poor.In this paper,based on the ROC-SVM algorithm,this paper introduces the regularization of Li norm.On the basis of minimizing the AUC value as indicator which is immune to imbalance problem,the regularity of Li norm is used to reduce the adverse effect of sparseness of text space model on classifier.At the same time,it also greatly reduces the time required for testing. |