Font Size: a A A

The Research Of Image Spam Detecting Based On Similarity Assessment

Posted on:2013-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z H WangFull Text:PDF
GTID:2218330371957533Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of the internet, Email has become one of the most common media to distribute the information, which provides great convenience in our daily work and life. As the increasing reliance on email continues, the unsolicited bulk mail (Spam) also continues to grow. The flooding of the spam has led to great economic loss. Although current anti-spam technologies are quite successful in filtering text based spam emails, all these techniques are losing their potency as spammers become more agile. A new trend in email spam is the emergence of image spam. The image spam is substantially more difficult to detect, as they employ a variety of image creation and randomization algorithms. This type of image spam accounts for 40% of all global spam. Therefore, developing efficient image spam detection technologies is a very promising research area.Firstly, the definition and characteristics of image spam (the spam message is embedded into attached images in the email) are given in the thesis. The current image spam detection techniques have three categories: black and white list technology, behavior-based and content-based detection technology, where content-based image spam detection technology is the focus of current research. Then features of the image are analyzed and similarity metrics are introduced in the thesis.Secondly, a novel method using the Earth Mover's Distance (EMD) to measure email image similarity is proposed in this thesis. The local invariant features are extracted to represent the image signatures, and then an EMD threshold that is a weighted threshold is trained for classifying an email image as a spam or a ham one. The experiments show that the proposed method is effective though the performance of the classifier need to be improved in the future.Finally, a combining classifier framework, which combines a textual content classifier based on OCR and a visual content classifier based on EMD, is proposed in the thesis. Then three fusion algorithms, AND, OR and decision tree rule, are given as the combining rules. The experiments show that the combining classifiers greatly improve the classification performance and in particular the decision tree rule is the most effective fusion algorithm.
Keywords/Search Tags:Image Spam, Scale Invariant Feature Transform, Earth Mover's Distance, similarity assessment, combining classifiers
PDF Full Text Request
Related items