| With the rapid development of Internet,large amounts of information are displayed in readable web pages.Many web pages display a list of records in structurally similar forms,such as a list of merchandise on an online shopping web page.In order to format the records in the web pages into databases in the regular formats,many information extraction algorithms have been proposed.These algorithms extract information mainly by analyzing source code structures or using visual information of web pages.But at present,most of the algorithms take the web source code structures and visual information separately into consideration.Meanwhile,the algorithms themselves have poor self-learning ability.Therefore,we study the information extraction problem considering both visual information and hierarchical structures of web pages,and utilize feedback learning mechanism to improve the effect of extraction and learning ability.In order to combine visual information and hierarchical structures of web pages together,we use the rendering tree which is generated during the rendering process of the web page.The algorithm uses visual information to identify the data region,calculates the similarity of the records by using the hierarchical structure of the web page,and extracts records in a clustered manner.Then it aligns data items by using the weighted tree matching algorithm.Finally,experimental results show that combining visual information and web page hierarchical structure can improve the effect of extraction.In order to improve the ability of self-learning and dealing with complex web page structures,our algorithm combines feedback learning framework and information extraction algorithm together.This approach can improve the effect of extraction by using users’ feedback information,and carrying on multi-model learning which uses users’ annotation information.Experimental results show that the proposed algorithm based on feedback learning has higher ability to deal with complex web page structures,and better extraction effect than existing algorithms.In order to apply the algorithm to industrial production and reduce the difficulty of use,we design and implement an interface of information extraction system.We introduce the function design and implementation of each module of the system in detail in this paper.Finally,we introduce the improvement of the system compared to existing information extraction projects. |