Font Size: a A A

Research Of Web Information Extraction Technology Oriented To Digital Tourism Website

Posted on:2013-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2248330395455599Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web have developed rapidly in recent years, Web information extraction whichuses Web as information resources has become a focus of data mining research. Theresearch of Web information extraction has achieved remarkable results, a variety ofmethods about Web information extraction have been proposed, and the applicationareas of Web information extraction technology are very extensive. In this paper, Webinformation extraction technology will be used in digital tourism websites to extract theinformation which the user interested in.At present, the data describing in HTML language on the Web are mainlysemi-structured data, which are viewed only by the browser but can not be resolveddirectly by the application. Through the deep analysis and research of the existinginformation extraction technology, we proposed a Web information extraction methodbased on DOM. By analyzing extraction rules based on absolute path and relative path,we found that the results were not perfect by only using the feature of path, for thisreason, we proposed an information extraction rule of feature comparison in this paper.We designed a DOM-based Web information extraction system in this paper andfinally made the system came true. First, this system parses a HTML page into a XMLDOM tree. Extraction rules will be generated through the rules learning phase, after that,extraction rules will be stored into the base of rules. Ultimately, we will get thedocument after extraction and the document will be stored in a relational database. Theresults of experiments showed that we obtained better extraction results by using theWeb information extraction method which was proposed in this paper, and we could gethigher precision rate and recall rate.
Keywords/Search Tags:Web Information, Extraction Technology, XML DocumentObject Model, Extraction Rules
PDF Full Text Request
Related items