Research And Design, Based On Xml And Xslt, Web Information Extraction

Posted on:2009-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:F Xiao

Full Text:PDF

GTID:2208360245961493

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the explosion of World Wide Web, "Information Overload" has become a serious problem. To help people accurately get the piece of information what he wants from the Web, information extraction from web pages is necessary. The program that performs this task is called wrapper. The key requirements are that a wrapper can be constructed rapidly, without much human intervention, and the wrapper should be robust, adaptable to the change of web page, moreover, the wrapper should be as general as possible, that is, it is independent on particular web site.Many approaches have been proposed to ease wrapper generation. Almost all of them use proprietary extraction languages. The languages are simple, hard to express accurate or complex extraction pattern. Although through labled examples, extraction rules can be induct automatically, they are not accurate, not robust or general.We apply standard technologies of XML to web information extraction problem. With standard XSLT, we can exploit strong and flexible features of the language to construct simple, robust and general extraction rules. We have developed a platform to ease wrapper construction.In addition to manually writing extraction rules, we proposed novel approaches to automatically induct page template and record template, including extraction rules for each template. Page template can be used to extract main content of a web page, which is critical to many works on page content such as web information retrieval, web document clustering and classification and etc. Record template can be used to extract list data in web page. Because the extractin rules are in XSLT, they can be easily understood and revise.At last, we developed mutli-page information extraction framework. Practial applications often need multi-page information extraction. With our platform, we can devolop robust and general wrapper rapidly.

Keywords/Search Tags:

Web information extraction, Extraction rule, XML, XSLT

PDF Full Text Request

Related items

1	Research On Web Informaition Extraction Techniques
2	Web Information Extraction Based On Principle Part Extraction
3	The Study Of Rule Induction For Automatic WEB Data Extraction
4	Design And Implementation Of Web Information Extraction Rules
5	Semi-structured Web Information Extraction Technology And Its Application
6	Design And Implementation Of Web Information Extraction Based On Dom
7	Design And Implementation Of Web Information Extraction Based On DOM
8	Study On Information Extraction And The Index Of Topic Search Engine
9	Semi-structured In The Xml-based Web Information Extraction
10	Research Of Web Information Extraction Based On XML