Font Size: a A A

Research In Sub-Topic Based Multi-Document Summarization

Posted on:2009-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhouFull Text:PDF
GTID:2178360245969817Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Multi-Document summarization is an important branch of natural language processing. It aims to extract important information from a group of documents which sharing similar topic and generate well-covered and concise summary, which can help getting and using information in a quick way.In this thesis, we researched on sub-topic based multi-document summarization, which generate summaries by first dividing the whole document set into several sub topic and then selecting the most important information from each sub topic and organizing them in a logical way. It includes two steps: sub topic clustering and sentence selection.Sub topics are groups of information fragments which sharing the similar meaning. These information fragments are across the whole document set, indicating the one aspect of the main topic. To discover the sub topics, we cluster the document set into several groups according to their semantic similarity. In this thesis we examined three methods for sentence similarity calculation and proposed N-gram semantic similarity based method. We performed hierarchy clustering to the sentences on the basis of sentence similarity calculation, and generated sub topics.For sentence selection, we researched on three sub tasks: topic centroid extraction and sentence selection strategy.We proposed to use document centroid and cluster centroid to represent the global topic and sub topic respectively. The centroid is a group of words which are of the ability to indicate the topic. We examined two methods for extracting global topic centroid and each sub topic centroids, Count-IDF method and hypothesis testing method and analyzed their differences.Sentence selection strategy is the key step of summary generation. Sub topics are first ranked according to their importance, which determined the order for sentence selection. We proposed to use the sum of weighted words to represent the information of the summary, and then sentence selection is a process to maximize the information of the summary. Sentences are selected according to their importance and the information of the sentences selected before. Our strategy is proved to be useful for information coverage.We generated summaries for five different document sets, of which each one is about a topic. We also used several different evaluation methods to evaluate the quality of summaries. The results showed that our method for sub topics clustering and sentence selection can help to improve the summary quality.
Keywords/Search Tags:multi-document summarization, sub topic, cluster, sentences selection
PDF Full Text Request
Related items