Font Size: a A A

Research Of User-defined Requirements’WEB Information Extraction Based On XML

Posted on:2015-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2268330428480419Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the Internet in recent years, Internet has become a huge platform for publishing and sharing information resources. How to get information efficiently that users need from the massive, unstructured or semi-structured data is a hot issue that remains to be solved. Web information extraction technology arises at the historic moment. Currently, a lot of research work has to be done, but there still exists many shortcomings:extraction method is too professional, not only increase the burden of the user’s semantic understanding, and not making it easy for users to use; During the information extraction, it is difficult to obtain user feedback in time, which will influence extraction effect;more complex the extracted contents come to be, the worse robustness of extraction rules are.Acordding to these, this paper deeply studies XML and the related standards, the existing XML based extraction method, then proposes a method of WEB information extraction based on xml which demands user requirments. Research work includes the following several aspects:(1) Processing the pages to be extracted. After preprocessing, HTML pages can filter out irrelevant information and code, and be converted into a well-formed XML document.Then generate a visual DOM tree from XML document to show the structure of pages clearly for user. During this procedure, DOM tree stores each node types and calculates the path expression, which are prepared for sample mapping and generate rules.(2) Implementing for the users’extraction requirements. Study describes the hierarchical relationship between nodes by defining the data schema that will be regared as the output style structure of extraction information. The sample user marks is used to generate extraction rules.The sample would be mapped to data schema in form of structure mapping or content mapping according to the mapping rule,getting data information and location information.(3) Completing the extraction rules.Extraction rules are made up of one or more templates that match the data schema.Templates are constructed according to whether there is structure mapping to the root node. If there is structure mapping to the root node,templates are generated recursively for each layer nodes by class attribute matching the whole content and relative path covering the relationship of parent-child and ancestors and descendants.If there is no structure mapping to the root node,public path form root node’s children will be taken for starting point of the template. Due to the starting location is the only, so extraction only includes sample data.Finally through the contrast experiment to verify the effectiveness of the extract method, proved this method to extract is better than the existing two methods. Extracting rules has a good robustness if the structure of content is very complex. At the same time, the prototype system is implemented by this method, system demonstration shows that the user can not only intuitive observed the whole process of information extraction, but also to determine the extraction results are accurate and can be easily modified.
Keywords/Search Tags:WEB Information Extraction, XML, User-defined requirements, Xpath, Extraction Rules
PDF Full Text Request
Related items