Font Size: a A A

Research On Spam Filter Model Based On Support Vector Machine

Posted on:2009-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:J W GaoFull Text:PDF
GTID:2178360245486583Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, E-mail has become a primary means in modern telecommunication. However, spam, simultaneously pervades widespread on line, bringing a lot of troubles to numerous users. Therefore, it is important and practical to prevent and control spam effectively.The thesis, on the one hand, investigates thoroughly considerable anti-spam documents and data from both home and abroad. Furthermore, analysis and conclusion are made on existing anti-spam techniques. The E-mail filter technology is an important measure against spam, which at present is mainly based on IP address, rules and the content respectively.The focus of this dissertation is on the E-mail filter technology based on E-mail contents. It is a technology to filter E-mail through analyzing the contents of E-mail. Actually, it is a matter of text categorization, i.e., to preprocess the text contents of mail and then recognize spam over text categorization. In this thesis, the techniques of text categorization are studied deeply and then focus on theory of Support Vector Machine, practicing in text categorization and using on anti-spam.The thesis uses text categorization method to mainly filter HTML type spam, studied deeply on preprocessing methods, text parsing techniques, getting rid of noise, Chinese and English segmentation and feature selection based on similarity-curve for HTML spam.A processing system of anti-spam based on SVM model was designed and implemented, used Forward Maximum Matching method, Lucene and GATE tools to realize the Chinese and English words segmentation, adopted similarity-curve for feature selection and extraction, used weigh formula thinking about the address of words and used SVM algorithm and LIBSVM tool to realize the classification of E-mail by content in this thesis. It has been shown by the experiments result that the utilization of SVM algorithm into spam processing will be one of the effective ways to realize the characteristic filtering on spam.
Keywords/Search Tags:spam filter, support vector machine, feature selection and extraction
PDF Full Text Request
Related items