Font Size: a A A

Application Restructuring And Content Extraction Based On Internet

Posted on:2015-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ZhangFull Text:PDF
GTID:2298330467463420Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Along with the computer technology development, more and more network applications have appeared in our daily life which can meet different people’s demand and produced a great deal of data information at the same time. In order to maintain the normal order of the Internet, we need to filter bad information and extract useful information.This thesis is devoted to research and realization of application restructure and content extraction. It consists of three parts. The first part is about design and implementation of network application restructure. In the second part, we extract information based on the regular expression from the popular BBS community application. In the third part, we proposed the method of content extraction which can extract the specified information from the different web pages.Firstly, the thesis introduces some basic concepts about HTML language and DOM and shows several technologies involved such as packets collection, the hash technology. Secondly, we design and implement the process of network application restructuring. We restructure the TCP session by libnids open source library and reassemble the HTTP packet data through uncompressing the compressed data and decoding the chunked data and decoding. And then we analyzes a lot of communication packets from BBS and propose the concept of BBS fingerprint which can used to detect the different BBS systems and extract the users’information efficiently. Finally, we put forward the method of content extraction base on DOM tree combined with the web features and characteristics of content extracted. Through the above mentioned method we realize the module which can keep track of the application version.
Keywords/Search Tags:TCP restructuring, HTTP data reassembling, BBS fingerprint, information extraction
PDF Full Text Request
Related items