Research Of User-defined Requirementsâ€™WEB Information Extraction Based On XML

Posted on:2015-03-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2268330428480419

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of the Internet in recent years, Internet has become a huge platform for publishing and sharing information resources. How to get information efficiently that users need from the massive, unstructured or semi-structured data is a hot issue that remains to be solved. Web information extraction technology arises at the historic moment. Currently, a lot of research work has to be done, but there still exists many shortcomings:extraction method is too professional, not only increase the burden of the userâ€™s semantic understanding, and not making it easy for users to use; During the information extraction, it is difficult to obtain user feedback in time, which will influence extraction effect;more complex the extracted contents come to be, the worse robustness of extraction rules are.Acordding to these, this paper deeply studies XML and the related standards, the existing XML based extraction method, then proposes a method of WEB information extraction based on xml which demands user requirments. Research work includes the following several aspects:(1) Processing the pages to be extracted. After preprocessing, HTML pages can filter out irrelevant information and code, and be converted into a well-formed XML document.Then generate a visual DOM tree from XML document to show the structure of pages clearly for user. During this procedure, DOM tree stores each node types and calculates the path expression, which are prepared for sample mapping and generate rules.(2) Implementing for the usersâ€™extraction requirements. Study describes the hierarchical relationship between nodes by defining the data schema that will be regared as the output style structure of extraction information. The sample user marks is used to generate extraction rules.The sample would be mapped to data schema in form of structure mapping or content mapping according to the mapping rule,getting data information and location information.(3) Completing the extraction rules.Extraction rules are made up of one or more templates that match the data schema.Templates are constructed according to whether there is structure mapping to the root node. If there is structure mapping to the root node,templates are generated recursively for each layer nodes by class attribute matching the whole content and relative path covering the relationship of parent-child and ancestors and descendants.If there is no structure mapping to the root node,public path form root nodeâ€™s children will be taken for starting point of the template. Due to the starting location is the only, so extraction only includes sample data.Finally through the contrast experiment to verify the effectiveness of the extract method, proved this method to extract is better than the existing two methods. Extracting rules has a good robustness if the structure of content is very complex. At the same time, the prototype system is implemented by this method, system demonstration shows that the user can not only intuitive observed the whole process of information extraction, but also to determine the extraction results are accurate and can be easily modified.

Keywords/Search Tags:

WEB Information Extraction, XML, User-defined requirements, Xpath, Extraction Rules

PDF Full Text Request

Related items

1	Research And Implementation Of Web Information Extraction Based On XML
2	The Study Of Semi-supervised Web Data Extraction Rule Induction Based On User Interaction
3	The Design And Implementation Of A Multi-user Based Web Information Extraction System
4	Research On Web Informaition Extraction Techniques
5	Research On Language And Key Techniques For Accurate Information Extractionrules Towards Complex Web
6	Design And Implementation Of Web Information Extraction Rules
7	Semi-structured Web Information Extraction Technology And Its Application
8	Design And Implementation Of Accurate Web Information Extraction System
9	Data Extraction Technology Research Based On The Location Of Web Information
10	Semi-structured In The Xml-based Web Information Extraction