Font Size: a A A

Research And Implementation Of Information Extraction Based On Web

Posted on:2015-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2308330473454015Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet and its related technologies, Internet has become the main tool to send and receive information. However, it is difficult for users to acquire useful information due to its enormousness. Searching particular information from geometric web pages is time consuming, and the results are seldom satisfactory. Therefore, how to acquire the information efficiently has become a question to be solved in the field of information extraction.This paper mainly studies a new information extraction approach, which tests data-intensive web pages automatically. There are two sub-questions.Firstly, initialization. Convert all the templates in test collection into HTML format.Secondly, automatic web page noise reduction. There are a lot of useless information such as navigation bar, advertisements, logos and copyright information on most of the web pages, especially commercial websites like dangdang, amazon and taobao. This paper uses an improved double sequence alignment to reduce web page noise.Thirdly, automatic template extraction. Websites are mostly using dynamic web page technology instead of traditional static HTML page ‘dynamic’ refers to the technology of template and back-end database used by the system. This paper aims to study information extraction from templates. With the help of XML Developer Toolbar, the defective page after noise reduction is transformed into the standard XML page. In addition, the pages are used for test collection, and template extraction is tested.Finally, the data-intensive page collection from the real websites is tested. The result shows the efficiency of the improved double sequence alignment in noise reduction and the information extraction system in information extraction.
Keywords/Search Tags:Information Extraction, Double Sequence Alignment, Template
PDF Full Text Request
Related items