Font Size: a A A

Research On E-mail Sensitive Words Detection And Alarm Based On Hadoop

Posted on:2016-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:J SunFull Text:PDF
GTID:2298330452466416Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, data begin to grow explosively, the growth of Internet users are increasingly drowned in the ocean data. Therefore, how to quickly detect the E-mail if it contains sensitive information from a large amount of messages has become a serious problem. As an important technology for maintaining social stability, E-mail sensitive words detection and alarm technology need the developed sensitive thesaurus firstly,and then,E-mail content information will match with sensitive thesaurus, in order to identify illegal E-mail that containing sensitive information,finally,it alarms.However, traditional E-mail sensitive word detection and alarm technology faces some shortcomings:ignoring attachments, not applying to ultra large amount of data, simple alarm rules and low ratio of catched illegal E-mail.This paper based on an actual project for information security audit system, was began from the background,significance and research status of custom sensitive words detection and alarm techniques,and had a more in-depth research about it. On this basis, this paper proposed matching algorithm based on the splitting the Chinese words and alarm rules based on decision tree, which can ease the major challenges to some extent. Finally, by means of MapReduce, Hive, HBase, R and other tools, the paper implements these algorithms on the Hadoop platform, and build a detection and alarm system. To sum up, the main work of this paper is reflected in the following aspects:1) For the huge amount data of E-mail attachments, this paper researched the popular methods of splitting Chinese words, divided the huge amount data of attachment content into words, let these words to match with the sensitive words, which can ease sensitive words matching algorithm complexity because of the large amount data.2) Because conventional alarm methods is simple and has low ratio of catched illegal E-mail, this paper used the current mainstream of decision tree algorithm and developed a white-list, black-list and manual inspection of the check system, so as to more effectively develop scientific alarm rules.3) To the problem of Large data processing and algorithm scalability issues, sensitive words detection algorithm will be deployed on the Hadoop cluster, achieving parallelism, which can improve the system’s scalability to some extent. The E-mail content information will be stored in HBase, the detection results will be stored in Hive,which can ben effective to solute the problem of storing and analyzing with big data.4) By means of MapReduce, HDFS, HBase, Hive, R and other tools,this paper designed and implemented E-mail sensitive words detection and alarm system for future research foundation.
Keywords/Search Tags:sensitive words detection, alarm, Chinese words split, Hadoop
PDF Full Text Request
Related items