Font Size: a A A

Research And Implement O F Web In For Mation Integration Technology

Posted on:2011-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:L J JiangFull Text:PDF
GTID:2178360308463851Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the Internet has become more and more dynamic. Now, many web pages are based on the users'query behavior. These dynamically generated pages are called Deep Web. For extracting useful information from Deep Web, realizing the information integration, we need to find a simple and effective technology.Basing on the workflow mechanism of information exchange platform (IEP), which is developed by author's lab, this paper focuses on information extraction technology and information integration technology as follows:1) Web information extraction based on web page schema. The structure of the current page shows the diversity and non-structural features. Through this article, we compare and analyze similar web sites. We also remove redundant information by parsing pages. Finally we can find the location of the theme module. Then we automatically identify the schema, generate extraction rules, and extract information by the rules. So we can turn the unstructured content into the user structured data which users really need.2) Web integration technology by simulation of user behavior. By recording user behavior of web operation, the algorithm can simulate user behavior to interact with the network server, instead of the user manual input. The algorithm automatically submits the request, in order to gain the dynamic information hidden in the server back. In order to achieve the integration of Web information, first of all, user should define the specific processes of information exchange. Based on the process file configured by user, the workflow execution engine will schedule each module to realize the information integration and exchange.This paper designs and implements a web information exchange platform base on the research mentioned above. The experiments of the system shows: the schema recognition algorithm can automatically identify the schemas of different Web pages, and can extract the data based on the schemas. While relying on the process workflow engine, the process configuration and process execution can realize the web system integration and exchange. The system supports heterogeneous data sources and a variety of data terminal types. System also has good performance and good prospects for commercial application, especially be suitable for SMEs.
Keywords/Search Tags:Web Information Integration, Deep web, Web Schema identify, Simulation of user behavior
PDF Full Text Request
Related items