Font Size: a A A

The Study Of Semi-supervised Web Data Extraction Rule Induction Based On User Interaction

Posted on:2015-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:L LuoFull Text:PDF
GTID:2308330485990668Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid eruption of information technology has made the web the most popular way of posting and sharing information. Since the emersion of WEB 2.0, different kinds of internet applications have come into our lives, which stimulates the increase of web pages. Massive web data contains large quantities of valuable information, most of which are accessed directly from the internet by many web based applications. Thus, web information extraction has become an important researching field.The main component of web information extraction tools is the extraction rule(wrapper), which is an absolutely essential part. However, there is a difficult problem on how to obtain extraction precision while guarantee the degree of automation. Manual rules are inefficient during the extraction procedure, while automated extraction rules could not meet the needs on extraction precision. In order to solve this problem, we propose a semi-supervised technique on generating web data extraction rules, based on user interaction, which obtains high precision. In this paper, the main work is divided into three parts:(1) Web data extraction rule induction based on small sample semi-supervised learning. It is difficult to generate reliable rules for items when there is only one record on one single page. Therefore, we need to provide a small sample annotated pages, concerning the context of data records, and generate reliable rules according to similar data records. We test and merge the node features until finding the most matched extraction rule, based on a step by step Apriori algorithm, which also considers structure features, attribute features and content features of DOM trees.(2) Regular web record extraction rule induction based on user interactions. In this paper, regular data records extraction are divided into three classes according to the relationship between DOM tree structure and visual structure of web pages, namely, row-based records extraction, column-based records extraction and grid-based records extraction. And then, we design a layered structural extraction rule system. Finally, we design a method on how to generate rules by user interactions according to this rule system. We make a UI based interacting system, assisting ordinary users to generate their own rules.(3) Design and implementation of web text fine-grained extraction rules. Because structural extraction rules only obtains rough data items, we design a web text fine-grained extraction rules to provide a second chance finer-grained data items extraction mechanism. The extraction rules contain extraction range and pattern of web texts, which makes precise extraction possible.We have carried out the experimental test on each part of research. Result shows that the semi-supervised wrappers based on small sample pages are rather robust, also have a high precision and recall. For regular record page, we can get structural extraction rule by a small amount of user interaction and get a good extraction results. As a supplement to the structural extraction rule, web text fine-grained extraction rules also achieve satisfactory results.
Keywords/Search Tags:accurate web information extraction, wrapper induction, XPath rule, semi-supervised learning, web text rules
PDF Full Text Request
Related items