| With the rapid development of Internet, Web has been coming into the era of information explosion. Network has become our main platform for issuing and accessing information. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases, such as product description pages on e-commerce sites. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data (e.g., product name, price…). These types of pages constitute an important part of the so-called"Deep Web", which cannot be easily indexed by general search engines. For lack of structural and semantic information, data in deep web pages can't be used directly. Therefore, how to extract data from the web pages become urgent. Extracting data from web pages is accomplished by wrapper, which is a procedure for extracting a particular resource's content.In this paper, first of all, we introduce some basic concepts of Web information extraction and give a short introduction to the development of the technology of Web information extraction. Then the definition of the web pages used in our algorithm is described.Secondly, we describe, compare, and analyze several kinds of Web information extraction methods commonly used at present in detail, pointing out advantages and disadvantages of each method. The current web information extraction technology according to their working principle of different wrapper can be divided into the following categories: the information extraction based on the mode of natural language processing, the information extraction based on inductive learning, the information extraction based on HTML structure, the information extraction based on the definition of query language, the information extraction based on visual feature, and the information extraction based on ontology. Based on Web information extraction of automation, Web information extraction system is divided into four categories: manually constructed IE systems, supervised IE system, semi-supervised IE systems, unsupervised IE systems. Furthermore, we discuss the future direction of research and development of Web information extraction.In this paper, we adopted sequence alignment algorithm to eliminate the effects of common framework while generating templates and extracting information from deep web pages. The information extraction system we presented is composed of three steps, which are tokenization, common framework detection and template extraction.(1) Tokenization. Transform HTML web pages into string sequences.(2) Common framework detection. Adopt sequence alignment to divide string sequences into'common framework'and'data fields'. Common framework includes the information which is irrelative to the kernel contents of web pages and common in web pages from the same source, such as headers, tails, advertisements, orientations of browsers and flash etc. Data fields include all the parts remained after eliminating the common framework. Transform string sequences of data fields into tag trees and build the sample set for template extraction.Appropriate partition-level will divide the page into common framework and data region. By adopting the sequence alignment algorithm, we will detect the common framework. If regard the HTML pages as strings, we can use sequence alignment algorithm to compare different sub-strings to find those similar ones, which may represent common framework. The size of the dynamic web pages may be greater than 64k. In order to save storage space, only previous and current rows are stored and the local maximum and its source are recorded while calculating s matrix. This algorithm will directly obtain the best result without backtracking and reduce the computing time.(3) Template extraction. Find all the matches and mismatches among samples and construct the template. This algorithm using the RoanRunner system. When it runs on a collection of HTML pages, it tries to iteratively apply algorithm match to generate a common wrapper for the pages. The final wrapper is template. The algorithm initialized by taking any of the pages as an initial wrapper. At each successive step, it tries to match the wrapper generated at step before with a new sample. This is done by solving mismatches between the wrapper and the sample. Mismatches are very important, since they help to discover essential information about the wrapper. Whenever one mismatch is found, we try to solve the mismatch by generalizing the wrapper. The algorithm succeeds if a common wrapper can be generated by solving all mismatches encountered during the parsing.There are essentially two kinds of mismatches that can be generated during the parsing: (a) String mismatches, i.e., mismatches that happen when different strings occur in corresponding positions of the wrapper and sample. This is always the data fields to be extracted. (b) Tag mismatches, i.e., mismatches between different tags on the wrapper and the sample, or between one tag and one string.When it run on a collection of HTML pages, it tries to iteratively apply algorithm match to generate a final template for the pages. According to the hierarchical relationship of tree, we can conveniently determine the web data model and provide valuable data of semantic analysis.Due to the Web page data model is not simple flat table structure, we can not directly use database tables to extract data value. By introducing XML format for storing data, we use template to extract and combine data model into a structured XML data. Focus on data in the XML data sets because of XML data form is a kind of online data exchange standard form, and XML data can also be easily transformed to relational database, so we use data sharing to define the data structure in the XML Schema.Finally, on data-intensive web pages from real-world websites, the effects of the alignment parameter on extraction results and the phase of common framework detection on decreasing data quantity and increasing extraction accuracy were tested and evaluated. The experimental results proved the validity of this approach convincingly.Because of time and personal knowledge is limited, combined with the rapid development of computer technology, there are some inadequacies to be improved. The next stage of the research focused on the following aspects:(1) The design of HTML page becomes complexity, how to filter the irrelevant information effectively, improve the system stability and ensure the accuracy of extraction become significant.If the location of page elements such as visual information, setting parameters, and specific location can be used fully, the capacity of pre-processing can be improved.(2) This algorithm is based on sequence alignment, if the sequence alignment algorithm can be optimized, it may be helpful for improving the detection accuracy of the public framework.(3) As it is shown in experiment part, our extraction algorithm needs further improvement to adapt to record nested Web pages.(4) Additionally, how to combine data from multiple sources and analyze single-page data selection are our further study direction. |