Font Size: a A A

Research On Deep Web Information Extraction Technology

Posted on:2016-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:C GuoFull Text:PDF
GTID:2308330464452608Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, with the fast development of Internet technology, web information got an explosive growth and it provides a huge information resource for people. But, how to accurately and efficiently mining information from heterogeneous and diversity web is not an easy task, it has become a hot spot in data mining field.According to the depth in the web, the information can be divided into Surface Web and Deep Web. Deep Web content is stored in the web backstage database. And these contents are dynamically generated according to user’s query which makes them hard to be indexed by static link. Deep Web contains more abundant, more professional and high quality information which makes them have more commercially value.However, the task of extraction information from web pages is difficult, because of HTML’s design purpose to convey visual instead of semantic information and the web information is various, non-structural and filled with a lot of noise. All of these make them easy to browse for users but difficult to use.Deep Web information extraction purpose is to extract HTML format data and got the semantic relationships then save them in the database. Through a lot of study about the web extraction technology, this paper put forward two methods for Deep Web pages information extraction.(1)A web table extraction method based on Structure and OntologyFor the Deep Web information displayed by the table, this paper present an extraction method based on table structure and ontology. This method firstly locates the tables based on heuristic rules, and then analysis the table structure according to the label and the title ontology, at last extract and save the table data on the basis of the obtained characteristics.(2)A list information extraction method based on visual feature and templateFor the Deep Web information displayed by the div and list, this paper analyze the page’s visual features and present an extraction method base on visual feature and template. Firstly, this method split the web page into blocks by the VIPS algorithm, and then combined with tree edit distance algorithm to search the data region. At last, it need us manual configuration data items for data extraction then save the template.
Keywords/Search Tags:Deep Web, information extraction, table, DOM, visual feature, tree edit distance
PDF Full Text Request
Related items