Font Size: a A A

Research And Design, Based On Xml And Xslt, Web Information Extraction

Posted on:2009-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:F XiaoFull Text:PDF
GTID:2208360245961493Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosion of World Wide Web, "Information Overload" has become a serious problem. To help people accurately get the piece of information what he wants from the Web, information extraction from web pages is necessary. The program that performs this task is called wrapper. The key requirements are that a wrapper can be constructed rapidly, without much human intervention, and the wrapper should be robust, adaptable to the change of web page, moreover, the wrapper should be as general as possible, that is, it is independent on particular web site.Many approaches have been proposed to ease wrapper generation. Almost all of them use proprietary extraction languages. The languages are simple, hard to express accurate or complex extraction pattern. Although through labled examples, extraction rules can be induct automatically, they are not accurate, not robust or general.We apply standard technologies of XML to web information extraction problem. With standard XSLT, we can exploit strong and flexible features of the language to construct simple, robust and general extraction rules. We have developed a platform to ease wrapper construction.In addition to manually writing extraction rules, we proposed novel approaches to automatically induct page template and record template, including extraction rules for each template. Page template can be used to extract main content of a web page, which is critical to many works on page content such as web information retrieval, web document clustering and classification and etc. Record template can be used to extract list data in web page. Because the extractin rules are in XSLT, they can be easily understood and revise.At last, we developed mutli-page information extraction framework. Practial applications often need multi-page information extraction. With our platform, we can devolop robust and general wrapper rapidly.
Keywords/Search Tags:Web information extraction, Extraction rule, XML, XSLT
PDF Full Text Request
Related items