| Web has become the main carrier of all kinds of massive data and information during the age of Internet.People may choose the web as the main source of useful information. Currently, with the rapid development of electronic commerce, web applications such as vertical search, public opinion and sentiment analysis of social networking, rely on Web information extraction (WIE) techniques to obtain large-scale web data. Therefore, the study of Web information extraction technology has important significance of study and value of commercial applications. The core of Web information extraction area is how to design effective extraction rules to quickly and accurately represent the extraction logics of various web data records with complex structure so that we avoid write hard-coding program to complete the task of data extraction.Existing research of WIE has made certain achievements, however, the development of Web representation technology continue to bring new issue of WIE research and technology. Generally, existing research of Web information extraction and related extraction rules still exist following major shortcomings:1) In the aspect of design of extraction rules system and models, most research lack deep study of complete process of extraction models and cannot fulfill the functions of navigation, accurate data extraction and data integration treatment of the whole WIE process.2) Existing research ignore the study on classification model of modern complex web data records, reducing the scope of web-scale accurate data extractionapplications techniques.3) As for Web extraction languages, most extraction languages follow a declarativeapproach, however, they do not adequately meet the requirement of deep web pages with complicated structures.4) In the case of wrapper invalidation caused by update of dynamic Web page templates, although there are studies focusing on maintenance of wrappers, they seldom consider to add these functions into existing extraction rule system to reflect the expression power of rule detections and maintenance. 5) As for data features used by Web data extraction techniques, most of current mainstream approaches choose to adopt structural features and visual features of HTML DOM trees. Although they can handle most routine problems of Web data extraction applications, there exist some pages with complex structures that those two kinds of features cannot cover and handle. And the design and definitions of corresponding Web extraction rules language do not have sufficient features to increase the capabilities of expression and processing.6) Most of existing research failed to design and improve the execution process of extraction rule languages in the context of large-scale applications scenarios, lacking analysis and improvement of efficiency in the implementation process of extraction rule languages.On the basis of related work on Web information extraction rules, we carry out research work in five aspects in this thesis.1) We conduct research on design of the whole-process comprehensive WIE model to describe thenavigation logic, data extraction logic and data integration logic in complete WIE process and provide guidance for design of Web extraction rule language with capacities of browser navigation, accurate data extraction and integration.2) As for research on system and models of accurate Web extraction rules towards complex structures, in order to describe WIE more clearly and improve the ability of WIE technologies, we propose various models involved in accurate WEE study, including web data records with complex structures model, top-down hierarchical extraction process model which is based on DOM tree structures, page rules model and life cycle model of extraction rules wrapper which contains the design, generation, compile, execution, detection and maintenance phases of WIE rules.3) Based on extensive research on basic WIE models, we present a hierarchical framework of Web information extraction rules language:we build "data region-data record-data item" hierarchical mapping for each HTML page, utilizing structural, visual and semantic features of data elements on each level comprehensively and combining all kinds of extraction predicates to form powerful extraction language based on XML. The functions of language include positioning of data element, restructuring, extraction, fine-grained filtering, decimation anomaly detection, maintenance of rules and so on.4) According to the integrated multi-functional model and system of Web extraction rule language, we propose to set detection and maintenance rule to detect whether current rules are validate and perform automatic maintenance of wrappers when the template of web source changes.5) In order to further improve the capabilities of our extraction rule language, we complement semantic feature of data elements to design semantic-assisted data extraction rules. By merge semantic elements into existing system and framework of data extraction rules, we can solve the data extraction problems which we cannot handle only using structural and visual features of nodes.Above all, we conduct extensive research on execution optimization of our designed rule. Meanwhile, we study and implement a WIE prototype system. The extraction experimental results of commercial websites show that the extraction technology and extraction rules language this article proposed has strong expression and processing capabilities. |