Font Size: a A A

The Research Of Spam Filtering Technology By Semantics-based Text Classification

Posted on:2017-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:W HuFull Text:PDF
GTID:2348330503465926Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Spam, which is unsolicited bulk email, has packed into everyone’ daily life for decades. And the ballooned spam has deeply influenced the efficiency using of email as email is not only used to support conversation but also as a task manager, document delivery system and archive. Some researches even brought out a dreadful fact that all kinds of spam emails can account for as high as 90%. The topic of spam email varies from illegal products and services to intimidation and fraud, besides spam emails usually bring about information thefts by helping plenty of malware spread with extreme rapidity. Hence, in order to prevent the situation from deteriorating many solutions have been proposed and spam filtering technologies have been developed deeply and commercialized for years. But there still are hundreds of spam emails being encountered by a email user every year, implying that the improvement is still necessary in spam filtering.Since spam filtering can be achieved through analyzing each parts of email delivering process, there are lots of methods being commonly used in spam filtering and always working in conjunction, such as whitelists/blacklists, challenge-response, rule-based filtering, keyword-based filtering, content-based filtering, etc. This paper mainly focus on the text classification methods and natural language processing technologies commonly used in spam filtering. Spam filtering can be regard as a special binary classification task of text to determine whether an email is spam or not. The most familiar way to represent a set of texts is transform each text into a vector based on the words of it, then a vector space model of all texts in the set is formed. Through this vector space model text classification could be achieved with clustering method or machine learning algorithms. However, picking words as features to create text vector always lead to huge amount of computation in the process of classification because of the high dimensionality of space model built up, especially when large volume of texts are involved which is familiar in spam filtering. Although plenty of feature selection methods have been proposed in the literature, the high dimensionality of vector space still is the major difficulty of text classification. This is another aspect of spam filters requiring improvement.Machine learning algorithms are widely applied in text classification and usually perform well, among which Support Vector Machines and naive Bayes classifier are the most popular spam filters presented in the literature.In this paper, we present a novel and efficient Chinese spam filter based on semantic information in emails. We conducted an empirical experiment on a well-known, large and public databases, the TREC2006 c email corpora. Extracting semantic information of text as feature instead of words is an effective way to reduce the number of feature terms with highly efficiency. Besides, this spam filter not only classify email into two class, spam and ham, but also achieve multiple classification of text based on the semantic subjects, which is meaningful in building up personalized spam filter for each user. The experiments showed a satisfactory performance on multiple classification with less feature terms.Moreover, the method of attaching annotations based on semantics has great potential in text classification in plenty of areas with less limitation of different languages. It can play an efficient role in classification of SMS messages, news, scientific literature and the messages delivered through social networks. Besides, the convenience in realizing the personalized spam filtering is significance for user-friendly and approachable commercial spam filtering system.
Keywords/Search Tags:spam filtering, text classification, semantics
PDF Full Text Request
Related items