Font Size: a A A

Research On Chinese Spam SMS Filtering Method Based On Rough Set And Naive Bayes

Posted on:2013-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:T F CaoFull Text:PDF
GTID:2298330467453083Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
For series of problems and hazards caused by spam messages, many researchers have excellent researches about filtering spam message. Filtering methods about spam message mainly include black list method and white list method, method based on keywords and method based on message text. The first two methods are too simple and lack of agile, whereas method based on message text is more efficient. In this thesis, on the basis of previous studies, a feature weighting method is proposed and a system about Chinese spam message making use of Rough Set and Naive Bayes is designed.The main work of this thesis is as follows:1. Comparing several feature weighting methods, proposing a new feature weighting method which ensures the accuracy of classification coupled with Minimum Classification Error training method based on the traditional TFIDF, and experimental results prove the feasibility of the method.2. Two stage filtration using Rough Set and Naive Bayes. During the first stage, some basic character attributes and a decision attribute are extracted from the message header and content of the message, and so on. Rough Set is used to train decision rule, when test message comes, extracting related attributes that are existed in decision rule, if match between test message and decision rule is existed, then test message can be categorized into certain class, otherwise, test message need to be brought into the second stage. During the second stage, after splitting words and getting rid of stop words, message can be denoted by vector space model, in which every dimension can be calculated by weight formula, specifically, the value of every dimension consists of term frequency, feature entropy and a parameter which Minimum Classification Error trains every term to. During feature selection, selecting those terms which are larger than a fixed threshold. Finally, Naive Bayes classifies message according to terms from feature selection.3. Constructing a message corpus in the form of XML. Some characteristics of one message and message text are taken as a node, the XML is very suitable to create a simple database.4. Finally, a simulation system for message classification is constructed, and it is proved to be feasible.
Keywords/Search Tags:Rough Set, Naive Bayes, spam message filtering, feature weighting
PDF Full Text Request
Related items