Research On Deep Web Information Extraction Technology

Posted on:2016-12-17

Degree:Master

Type:Thesis

Country:China

Candidate:C Guo

Full Text:PDF

GTID:2308330464452608

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years, with the fast development of Internet technology, web information got an explosive growth and it provides a huge information resource for people. But, how to accurately and efficiently mining information from heterogeneous and diversity web is not an easy task, it has become a hot spot in data mining field.According to the depth in the web, the information can be divided into Surface Web and Deep Web. Deep Web content is stored in the web backstage database. And these contents are dynamically generated according to userâ€™s query which makes them hard to be indexed by static link. Deep Web contains more abundant, more professional and high quality information which makes them have more commercially value.However, the task of extraction information from web pages is difficult, because of HTMLâ€™s design purpose to convey visual instead of semantic information and the web information is various, non-structural and filled with a lot of noise. All of these make them easy to browse for users but difficult to use.Deep Web information extraction purpose is to extract HTML format data and got the semantic relationships then save them in the database. Through a lot of study about the web extraction technology, this paper put forward two methods for Deep Web pages information extraction.(1)A web table extraction method based on Structure and OntologyFor the Deep Web information displayed by the table, this paper present an extraction method based on table structure and ontology. This method firstly locates the tables based on heuristic rules, and then analysis the table structure according to the label and the title ontology, at last extract and save the table data on the basis of the obtained characteristics.(2)A list information extraction method based on visual feature and templateFor the Deep Web information displayed by the div and list, this paper analyze the pageâ€™s visual features and present an extraction method base on visual feature and template. Firstly, this method split the web page into blocks by the VIPS algorithm, and then combined with tree edit distance algorithm to search the data region. At last, it need us manual configuration data items for data extraction then save the template.

Keywords/Search Tags:

Deep Web, information extraction, table, DOM, visual feature, tree edit distance

PDF Full Text Request

Related items

1	Table Information Extraction Based On Web Structure
2	Web Information Extracting Based On Tree Edit Distance
3	Algorithms Based On Visual Similarity Of The Research In Information Extraction And Implementation
4	The Research Of Semi-structured Web Pages Information Extraction
5	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
6	Research On Automatic Web Information Extraction Technique
7	Research Of Web Information Extraction Based On Table Structure
8	Research On Technology Of Table Information Extraction In Semi-Structured Texts
9	Workflow Application Of Clustering Tree Edit Distance
10	Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM