Font Size: a A A

A Novel Approach For Filtering Chinese Image SPAM

Posted on:2016-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:B XuFull Text:PDF
GTID:2298330467491898Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the success of traditional text spam filters, spammers start spreading spam information by images. This new type of spam is more difficult to detect, waste more internet bandwidth and storage, therefore it poses more threat, especially for Chinese image spam. By analyzing traditional image spam filtering methods and the need of extracting enough content information, a novel pseudo-OCR technique for Chinese spam filtering is proposed by improving classical optical character recognition (OCR) based methods. In order to balance the need of content information extraction and the detecting performance, we believe the most of recognition of traditional OCR is redundant. So, we loose the recognition requirement, optimize the preprocess procedure specific for spam images and adopt a more data-oriented sample library. Experiment results show that we can manipulate the detecting performance of image spam by easily adjusting some parameters. And by comparing with traditional OCR based method, the proposed pseudo-OCR exhibits a much higher performance, especially when a low false positive rate is required.As for the core functionality of pseudo-OCR, in this paper we defined a novel key-point based statistical Chinese character feature. Which is extracted by a carefully designed depth first search (DFS) based method considering neighborhood information and Chinese character shape comprehensively. The results illustrated that it outperforms traditional corner detecting methods for Chinese character key-point extraction.What’s more, to improve the relatively low recall rate introduced by pseudo-OCR, this paper adopted two low-level image feature based methods as supplementary filtering technique for pseudo-OCR, therefore a completed system for Chinese image spam filtering is constructed. This composed system sacrifices a little bit precision and false positive rate to get a much better recall rate, therefore has a better performance.
Keywords/Search Tags:image spam optical character recognition pseudo-OCR, low-levelimage feature
PDF Full Text Request
Related items