Font Size: a A A

Design And Implementation Of Web Information Collection System

Posted on:2014-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:L Y ZhouFull Text:PDF
GTID:2248330398974673Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and popularity of mobile terminals, people are increasingly accustomed to obtaining information of interest through the reading application software that installed on the mobile terminal, at the same time, platform vendors (also including content providers) must construct the corresponding technology platform to support such a business. The contents of this platform sources can be obtained in two ways. One is manual editing, and the other is to automatically collect information through the program from information source. In this paper, as to the latter one, there is a Web information collection solution.This paper first introduces the research background, research status, the relevant information extraction technology, as well as including giving information collection works and webpage structure analysis. Secondly, there is a detailed analysis of the system function and the user of the system, the system use case modeling consists of using use case diagrams and use case specification, and analyzing the system’s non-functional requirements. Then, design the system and database. Once more, gives out a detailed system design and implementation. Finally, verify the effectiveness of the program by means of testing the system. The key work is as following:1. This paper analyzes how to locate object information in the HTML document, and designs information extraction rules based on simple visual interface and human-computer interaction through HTML tags and attributes and DOM path expression. Then, gives a solution for main body de-noising based on above.2. This subject includes collection configuration subsystem and collection subsystem. The former pass the configured acquisition task to collection subsystem through the socket mechanism in order to control the task of open and stop operation. The benefits of doing so is to get the collection result and not concern about the operation process for user.3. Acquisition subsystem regularly and automatically collect、extract de-noise、 de-emphasis information based on user configuration on these sites by multi-threading technology, database connection pool technology, dynamic acquisition strategy and multi-page consolidation technology. Update at regular time collecting of site-specific information.
Keywords/Search Tags:Web information collection, Web information extraction, DOM, Jsoup, multiplethreads
PDF Full Text Request
Related items