Font Size: a A A

Research On Automatic Summarization From Web Text

Posted on:2013-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:X L DuanFull Text:PDF
GTID:2248330371497518Subject:Information management and e-government
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology, Web information has become one of the most important information resources. But the attendant problem is the "information explosion". Web page contains a lot of redundant information, such as advertising fields, navigation tools, copyright information and other non-page subject information. How to quickly and accurately find the desired information is an urgent problem to be solved. Summary is the enrichment and refining of a text. Readers can read the summary text to decide whether it is necessary to read the full text, thus saving valuable time and energy.Web content extraction is the foundation of Web information processing work (information retrieval, text mining, etc.). After analyzing and summarizing the related research, this paper analyze the theme pages’characteristic, including topical text features and structure characteristics, and present a kind of theme pages text extraction method combining web page text features and HTML tags characteristics. Firstly, acquiring the text content block according to the DOM tree parsed from the web pages, and then analyzing the characteristics of noise information in the text content block in order to remove the noise information. This method does not require study the sample prior, it has good ability to adapt and consider how to process the noise information. On the basis of this processing, considering the completeness of subject extraction, this paper adopts understanding method combining with structures method. At first, we divide theme for the text. Then construct a sentence relation map for each sub-theme and sort the sentences in the relation map using the PageRank algorithm. At last extract the subject sentence for each sub-theme according to extraction rule. This approach ensures that the extracted sentence is the most widely range of semantics coverage for each topic in the text.We design and implement an automatic web summary extraction system. The experiment collected the real corpus from the internet. And then compare and analyze the experimental results with existing similar method. Using precision and recall rates as evaluation indicators, firstly analyze the results that extracted by this paper’s web content extraction method, then evaluate the quality of the extracted summary. Experiments show this algorithm has high accuracy and good topic coverage.
Keywords/Search Tags:Web Content Extraction, Topic Analysis, PageRank Algorithm, SummaryExtraction
PDF Full Text Request
Related items