Font Size: a A A

Web Object Extraction Retrieval System Design And Implementation

Posted on:2009-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:G J LiuFull Text:PDF
GTID:2208360245465736Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently , the great development of Internet is pushing the pullulating of economic and technologies, and Search Engine is the most active actor. It solves the problem of searching information what we want from the huge data. Currently, General Search Engine (also named Page Search Engine) based on the pages still acts the most important role, but it can't satisfy all the needs especially for the information of object not page. Object Level Search Engine is born to come true the users'need for searching objects not pages. Generally Web Information Crawler, Web Object Extractor and Search Interface are the main parts of Object Level Search Engine. Web Object Information Extractor and Integration become the core and hard point in the Object Level Search Engine and it is the obvious difference between General Search Engine and Object Level Search Engine.Author once worked as an intern in some company and joined in the development of Object Level Search Engine project, especially researched in the algorithm of web information extractor.1. Implementing a configurable web information snatching system based on multiple threads framework. It supports multi data sources from a lot of URLs, and gets the web pages with deep crawling.2. Design a new algorithm of web information extracting which contains the idea of Template and Wrapper. Wrapper is used to get the Web pages from internet and save them as local files under appointed category. Template is used for extracting web information to construct object from the local files and save them in database.3. Design a new algorithm of information integration. There are some problems like repeat, inconsistent or conflict in the process of multiple data source extracting. Parsing the original object structure and applying the algorithm of synonym judgment to judge the object properties whether repeat and object values whether conflict, and then define a series of rules to integrate the properties and values and save the object in database.4. Realize the web reformative algorithm and apply in Object Search Project with excellent results that there is 90% veracity. This dissertation introduces the details of implementing the Object Level Search Engine with the author's web object extraction method.
Keywords/Search Tags:Page-Level Searching, Object-Level Searching, General Search Engine, Information Extraction, Web Object Extraction, Template, Wrapper, Visual Information Retrieval
PDF Full Text Request
Related items