Font Size: a A A

Design And Implementation Of Web Information Extraction Based On Dom

Posted on:2010-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:X G LianFull Text:PDF
GTID:2198330338488072Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With Internet's rapid development ,Web has become the main way to access information. It is becoming more difficlut for people to get information they need ,just because the explosive and full of irrelevant information pages. At the same time as the "template + database" type web pages increased, Internet has emerged known as the "hidden network" ,the great warehouse of information resources. It is estimated that 80% of Internet content exist in this invisible Internet, the web crawler of the search engine can't grasp of. In addition, various sites on Internet are independent of each other, it is very difficult to be integrated.In this case, the usual search engines play a negligible role and Web information extraction technology is becomingvery necessary.Based on the summary and analysis of the existing information extraction techniques, combined with this dissertation for the "template + Database" page, the dissertation proposed a resolvent based on the structure of DOM, using XPath expressions to locate information points, using XSLT to describe extraction of rules. On this basis, a Web Information Extraction System with higher degree of automation and stronger applicability extraction rule is designed and developed. The system is divided into stages of learning, information extraction and database storage to complete the IE task. Stages of learning which is the key point, the dissertation design and implementation the leaf-node access path algorithms, data acquisition algorithms, semantic access path algorithm and optimization algorithm to create a strong and flexible rules using XSLT. In information extraction stage, the system uses URL matching pattern and DOM similarity algorithm to match the rules automatically.As well as to address the automation and the balance between accurate, the system provides easy GUI interface to support artificial guide training. The system has a effect result for "Template + database" pages.
Keywords/Search Tags:Web Inforamtion Extraction, XML, XSLT, DOM
PDF Full Text Request
Related items