Font Size: a A A

Research Of Pretreatment About The Noise Information On The Web Page Text

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:X M LiFull Text:PDF
GTID:2248330398986241Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In this thesis, we research on the method of pretreatment of noise information on theweb pages based on the text information. The negative information, including anti-government, pornographic, violent and so on, not only threatens to build a good networkenvironment, but also endangers social stability and human physical and mental health.With the development of the Internet and the enhanced monitor of it from national relateddepartments, much noise information begun to appear on the network and the negativeinformation has a lot of changes. They become so hidden that they are more difficult tobe identified by the relevant technologies. The traditional methods become ineffectual todeal with the negative information, and the classic algorithms also decrease theireffectiveness and accuracy. Therefore, the pretreatment of noise information is veryimportant to make the obscure negative information become easy to be identified byclassic algorithms.The thesis starts from the sensitive word "Falun Gong" and its kinds of interferingforms, by analyzing its interference types. We proposed different approaches, includingthe character encoding conversion, the removal of HTML tags, the removal of specialsymbols and the recovery of sensitive words. We try to establish the interferingvocabulary of the sensitive word, to collect and list as many the noise words as possibleof various interfering forms of the sensitive word in it. We priority collect the highfrequency interfering forms, and then increase to the low frequency ones. After that, wetry to collect other latest discovered ones to consummate of database gradually. Then, thesensitive word is mapped to the kinds of noise forms, to wipe off the noise information.During the above pretreatment of the noise information, firstly, we unified thedifferent encoding to GB2312of the web pages which grabbed by web Crawlers, toreduce the number of noise words with different encoding in the thesaurus, at the sametime to convert the traditional Chinese word to the simplified Chinese and to ensure theweb text to be processed has the same encoding mechanism with the training text samples.Then, we adopted the regular-expression to remove the HTML tags and the specialsymbols in text lines. For the large amount of noise words vocabulary, we adopted theWM multi-pattern matching algorithm to improve the processing efficiency. The WM algorithm could guarantee the processing time does not vary dramatically with theincrease of the vocabulary growth, which also saves the time of following procedures.Finally, we implemented the pretreatment system with above functions, and wetested the feasibility and effectiveness of the system. Its performance shows that, thesystem here could be used in text classification, sensitive word pretreatment, and could beused as a part of negative text filtering system, to improve its accuracy and effectivenesstoo.
Keywords/Search Tags:Internet, Noise information, Sensitive words, Pretreatment
PDF Full Text Request
Related items