Research And Implementation Of Information Extraction Based On Web

Posted on:2015-12-15

Degree:Master

Type:Thesis

Country:China

Candidate:C Zhang

Full Text:PDF

GTID:2308330473454015

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet and its related technologies, Internet has become the main tool to send and receive information. However, it is difficult for users to acquire useful information due to its enormousness. Searching particular information from geometric web pages is time consuming, and the results are seldom satisfactory. Therefore, how to acquire the information efficiently has become a question to be solved in the field of information extraction.This paper mainly studies a new information extraction approach, which tests data-intensive web pages automatically. There are two sub-questions.Firstly, initialization. Convert all the templates in test collection into HTML format.Secondly, automatic web page noise reduction. There are a lot of useless information such as navigation bar, advertisements, logos and copyright information on most of the web pages, especially commercial websites like dangdang, amazon and taobao. This paper uses an improved double sequence alignment to reduce web page noise.Thirdly, automatic template extraction. Websites are mostly using dynamic web page technology instead of traditional static HTML page â€˜dynamicâ€™ refers to the technology of template and back-end database used by the system. This paper aims to study information extraction from templates. With the help of XML Developer Toolbar, the defective page after noise reduction is transformed into the standard XML page. In addition, the pages are used for test collection, and template extraction is tested.Finally, the data-intensive page collection from the real websites is tested. The result shows the efficiency of the improved double sequence alignment in noise reduction and the information extraction system in information extraction.

Keywords/Search Tags:

Information Extraction, Double Sequence Alignment, Template

PDF Full Text Request

Related items

1	The Research Of Dynamic Web Pages Information Extraction Algorithm Based On Sequence Alignment
2	Design And Implementation Of Web Information Extraction Rules
3	Research On Sequence Alignment Algorithms Based On Network Protocol Reverse
4	Video Sequence Alignment Based On The Combination Of Movement Information And Background Information
5	The Research And Implementation Of Biological Sequence Alignment
6	Automatically Get To Build The Study Of Biological Information Platform And Sequence Alignment Algorithm Based On Information
7	Study On Chinese Sequence Alignment Based On Word2VEC And CRF
8	Biological Sequence Alignment Problem
9	Research Of Improvement And Parallelization For Sequence Assembly And Multiple Sequence Alignment
10	Research On WEB Information Extraction Method Based On Label