Font Size: a A A

Research On Multi-Document Summarization Method With Text Association

Posted on:2024-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhongFull Text:PDF
GTID:2568306941963699Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text summarization is a core problem in natural language processing.With the rapid development of the Internet,the demand for quickly obtaining the target information from massive data is increasing,as well as the application of multi-document summarization.Multi-document summarization refers to analyzing,refining,integrating and processing documents related to multiple topics,and generating a summary that can summarize the main content of all documents.However,some existing research works on multi-document summarization often simply concatenate multiple source documents into a long sequence,and model the multi-document summarization task as a long sequence-to-sequence task without considering document-level associations relation.At the same time,the extremely long input length of multiple documents usually exceeds the limitation of encoders,in this situation,if the truncation operation is adopted,key information will be lost easily,resulting in the waste of the context information in the document set.In addition,some research works also learn cross-document relations in multiple documents by mining co-occurrence words or entities,but due to the diversity of language expressions,it is difficult for these cooccurrence words or entities to capture the implicit connection between documents.Therefore,this thesis mainly focuses on how to effectively utilize the relationship between documents to improve the accuracy and comprehensiveness of multi-document summarization.The specific research content includes the following three aspects:First of all,for the problem that the traditional multi-document summarization research work ignores the document-level association relationship,this thesis proposes a multidocument summarization method based on the association discriminant model.This method first combines the siamese network and the pre-trained language model BERT to construct a twin-tower model for association discrimination;After that,the association discrimination model will be used to obtain the representation of each pair of two sentences,and splice the obtained sentence representation.Then the model will judge the semantic relationship between any two sentences from three different perspectives:whether they are the same topic,whether they have the same source text,and whether they are preceding and following sentences.The parameters in the summary model will be updated through the learning of the association discriminant model;Finally,use the summary model to select sentences that can better represent the main content of a collection of documents and organize them into summaries.Experimental results show that compared with traditional multi-document extractive summarization methods,this method obtains a large improvement in ROUGE evaluation criteria.Secondly,to utilize different dimensional associations among multiple documents,this thesis proposes a multi-document summarization method based on multi-dimensional association construction.This method first divides the document set into semantic nodes of three different dimensions:topic,source document,and sentence,and uses the pre-trained model BERT to encode the nodes of different dimensions;Then,the multi-dimensional multi-document association graph is constructed according to the multi-level relationship between document level and sentence level nodes,after that the graph convolutional neural network will be used to capture the cross-document relationship in the document set from different aspects;Finally,the integration of various document association graphs from different dimensions,will be used to guide the extraction of summaries process.Experimental results show that this method can make a fully usage of the relationship between multi-dimensional documents,and the model performance is obviously better than other baseline methods.Finally,for the problem that context information is easily lost after truncation during encoding long texts,this thesis proposes a multi-document summarization method combined with reference relations.This method first analyzes the referential relationship between sentences in each source document,uses the graph attention network to capture the referential relationship between sentences,and extracts candidate content from each source document;Then we add segment embeddings and source embeddings into the embedding layer of BERT pre-trained model,designed to learn the hierarchical relationship between documents in the encoding layer,solve the input problem of multiple sentences in the document set,and obtain the vector representation of each sentence in the document set more accurately;Finally,the extracted candidate content is connected and input into the modified BERT model to further judge the importance of sentences,and select the first few sentences with the highest importance to form a summary.Experimental results show that this method can effectively filter important information from the original document set and improve the performance of multi-document summarization tasks.
Keywords/Search Tags:Multi-document Summarization, Text Association, Pre-training Model, Graph Neural Network
PDF Full Text Request
Related items