Research On Automatic Summarization From Web Text

Posted on:2013-08-13

Degree:Master

Type:Thesis

Country:China

Candidate:X L Duan

Full Text:PDF

GTID:2248330371497518

Subject:Information management and e-government

Abstract/Summary:

PDF Full Text Request

With the rapid development of internet technology, Web information has become one of the most important information resources. But the attendant problem is the "information explosion". Web page contains a lot of redundant information, such as advertising fields, navigation tools, copyright information and other non-page subject information. How to quickly and accurately find the desired information is an urgent problem to be solved. Summary is the enrichment and refining of a text. Readers can read the summary text to decide whether it is necessary to read the full text, thus saving valuable time and energy.Web content extraction is the foundation of Web information processing work (information retrieval, text mining, etc.). After analyzing and summarizing the related research, this paper analyze the theme pagesâ€™characteristic, including topical text features and structure characteristics, and present a kind of theme pages text extraction method combining web page text features and HTML tags characteristics. Firstly, acquiring the text content block according to the DOM tree parsed from the web pages, and then analyzing the characteristics of noise information in the text content block in order to remove the noise information. This method does not require study the sample prior, it has good ability to adapt and consider how to process the noise information. On the basis of this processing, considering the completeness of subject extraction, this paper adopts understanding method combining with structures method. At first, we divide theme for the text. Then construct a sentence relation map for each sub-theme and sort the sentences in the relation map using the PageRank algorithm. At last extract the subject sentence for each sub-theme according to extraction rule. This approach ensures that the extracted sentence is the most widely range of semantics coverage for each topic in the text.We design and implement an automatic web summary extraction system. The experiment collected the real corpus from the internet. And then compare and analyze the experimental results with existing similar method. Using precision and recall rates as evaluation indicators, firstly analyze the results that extracted by this paperâ€™s web content extraction method, then evaluate the quality of the extracted summary. Experiments show this algorithm has high accuracy and good topic coverage.

Keywords/Search Tags:

Web Content Extraction, Topic Analysis, PageRank Algorithm, SummaryExtraction

PDF Full Text Request

Related items

1	Topic Search Engine Key Technology Research
2	Research On User Influence Of Weibo Specific Topic Domain Based On Interaction Relationship
3	Webpage Content Extraction Techniques For Specific Topic
4	Research On Key Techniques Of Topic-Oriented Blog Resource Mining
5	PageRank Algorithm Based On Chinese Research And Application Of Vertical Search Engine
6	Research On Keyword Extraction Based On Latent Topic Model And New Word Discovery
7	Web Page Sorting Algorithms Based On The Analysis Of The Linking Structure
8	A Number Of Studies Basesd On PageRank Sort Algorithm Improvenment
9	A Study Of Topic Modeling And Content Analysis For Soccer Video
10	Searching Topic-specific Authoritative Information Sources On The Web With Content And Link Analysis