Font Size: a A A

Cross-site Information Extraction Using Style And Layout Features

Posted on:2023-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Y XieFull Text:PDF
GTID:2568307043475134Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web information extraction is designed to automatically extract specified data from web pages.It is the data source of many downstream tasks.How to extract information from massive websites with high quality and efficiency has been a long-studied problem.The early methods can achieve high accuracy by providing a small number of extraction examples for each website,but the labor cost will increase with the increase of the number of websites.Later,some studies achieved cross site extraction by using the domain features of website data,but it is difficult to apply to websites in other fields.Now there are some works to use deep learning to complete the extraction on websites in the same domain by training on the data annotated from a small group of websites.For websites in different domains,it only needs to prepare data for retraining,but only the information of web page source files is used.It is easy to make mistakes when extracting information relying on visual clues.To solve this problem,taking Cross website and non-domain-restriction as the basic requirements,a cross website extraction approach Web SLG(Web information extraction using Style and Layout aggregated by Graph Convolution Network)is proposed.The information in web pages is presented as DOM nodes composed of HTML tags.Therefore,Web SLG models cross-site information extraction as a task of multi classification of DOM nodes.The main features are as follows:(1)automatically control the browser to open the web page to obtain the rendering results of each DOM node,and make full use of the style and layout information obtained by rendering to improve the extraction effect;(2)The graph convolution network is introduced to learn the feature representation of DOM node context,instead of the node context feature based on heuristic rule design.Due to the introduction of graph convolutional network,for content nodes,by using the layout information obtained from rendering and the DOM tree structure,a content node graph construction algorithm based on block layout is proposed,which makes the content belonging to the same block highly correlated and constructs edges in natural line order,and the content belonging to different blocks constructs edges only when the blocks are inclusion relations.Compared with fully connected graphs or simple adjacency graphs,it can avoid the introduction of irrelevant noisy contexts during node aggregation and still have consistent local layout relationships while the overall layout is different in cross-site extraction tasks.Experiments were carried out on public dataset with 8 different data domain.In the domain where the website style and layout are well preserved,the F1 score reaches the best94.49,which is only 0.9 points lower than the state-of-the-art approach of pre-training by extending Bert architecture,which verifies the effectiveness of the method.Through the experiment on the government announcement information dataset with strong style layout and weak semantic features,it surpasses the results of the rule-based approach which aimed at this specific field,and verifies the advantage of the flexibility of this approach.
Keywords/Search Tags:Web information extraction, Cross site, Visual cues, Web style and layout, Graph attention network
PDF Full Text Request
Related items