Design And Implementation Of Web Information Extraction Based On Dom

Posted on:2010-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:X G Lian

Full Text:PDF

GTID:2198330338488072

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With Internet's rapid development ,Web has become the main way to access information. It is becoming more difficlut for people to get information they need ,just because the explosive and full of irrelevant information pages. At the same time as the "template + database" type web pages increased, Internet has emerged known as the "hidden network" ,the great warehouse of information resources. It is estimated that 80% of Internet content exist in this invisible Internet, the web crawler of the search engine can't grasp of. In addition, various sites on Internet are independent of each other, it is very difficult to be integrated.In this case, the usual search engines play a negligible role and Web information extraction technology is becomingvery necessary.Based on the summary and analysis of the existing information extraction techniques, combined with this dissertation for the "template + Database" page, the dissertation proposed a resolvent based on the structure of DOM, using XPath expressions to locate information points, using XSLT to describe extraction of rules. On this basis, a Web Information Extraction System with higher degree of automation and stronger applicability extraction rule is designed and developed. The system is divided into stages of learning, information extraction and database storage to complete the IE task. Stages of learning which is the key point, the dissertation design and implementation the leaf-node access path algorithms, data acquisition algorithms, semantic access path algorithm and optimization algorithm to create a strong and flexible rules using XSLT. In information extraction stage, the system uses URL matching pattern and DOM similarity algorithm to match the rules automatically.As well as to address the automation and the balance between accurate, the system provides easy GUI interface to support artificial guide training. The system has a effect result for "Template + database" pages.

Keywords/Search Tags:

Web Inforamtion Extraction, XML, XSLT, DOM

PDF Full Text Request

Related items

1	Design And Implementation Of Web Information Extraction Based On Dom
2	Research And Design, Based On Xml And Xslt, Web Information Extraction
3	Research On Web Informaition Extraction Techniques
4	The Design Of Web Site Builder Based On XML/XSLT
5	Web Information Extraction Based On Principle Part Extraction
6	The Implement Of Web Presentation Layer Based On XSLT
7	Study On Information Extraction And The Index Of Topic Search Engine
8	Online Reporting Delivery Platform Based On XML And XSLT
9	Semi-structured In The Xml-based Web Information Extraction
10	Research Of Web Information Extraction Based On XML