Font Size: a A A

Extractive Summarization For Long Documents Without Manual Annotation And Low-resource Scenarios

Posted on:2024-09-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:M M TangFull Text:PDF
GTID:1528307145496294Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
Text summarization is an essential task in Natural Language Processing(NLP),which aims to compress the document and extract its main idea so that readers can quickly grasp the key information of the document.Text summarization can be divided into two categories: extractive summarization and abstractive summarization.Compared to abstractive summarization,extractive summarization constructs summaries by selecting important text snippts from the documents,thus the extractive summaries are strictly faithful to the original document.Therefore,making extractive summarization a preferable option for industrial scenarios that prioritize information accuracy.The academic community has conducted comprehensive research on extractive summarization,mainly focusing on supervised and reinforced summarization methods.However,existing extractive summarization methods still suffer from the following shortcomings: i)Limitations of Encoding Length and Lack of High-Quality Reference Summaries.Due to the encoding length limitations of Pre-trained Language Model(PLM),PLMs-based extractive summarization methods are difficult to apply to long document summarization.Moreover,such methods require large-scale summarization datasets with human-authored reference summaries to achieve outstanding performance.However,documents in real-world scenarios generally lack reference summaries.Therefore,existing PLMs-based summarization methods are difficult to apply to long document summarization tasks in real-world scenarios.ii)Error Propagation and Reasonable Evaluation of Summaries.Reinforcement Learning(RL)-based extractive summarization methods can hardly alleviate the “exposure bias” problem of autoregressive summarization.RL is typically used for Natural Language Generation(NLG)tasks with long sequence decisions,it suffers from the “error propagation” problem caused by “teacher forcing” training paradigm when combined with autoregressive summarization.Additionally,the reward mechanisms utilized by RL-based summarization methods cannot fully leverage RL to dynamically evaluate extractive summaries and alleviate the “exposure bias” problem.iii)Lack of Supervision Signals caused by Small-Scale Training Datasets.Existing extractive summarization methods are mainly applied to open-source datasets with largescale “document-summary” pairs,requiring sufficient supervision signals for outstanding performance.However,most real-world scenarios only have extremely small-scale summarization datasets,such as only 200 document-summary pairs.Therefore,existing extractive summarization methods are difficult to achieve satisfactory results in low-resource scenarios.This thesis seeks to address the aforementioned issues.The research content of this thesis is structured as follows:· This thesis studies BERT-based hierarchical encoders and reward functions based on Natural Language Understanding(NLU)tasks,and perform long document summarization tasks without reference summaries through a reinforced extractive summarization method.This thesis aims to address the limitation issue of encoding length for PLMs-based text encoder.To this end,this paper proposes a BERT-based ”sentence-paragraph-document” hierarchical encoder.Each paragraph in the document focuses on its own semantic point of view and echoes document’s main idea.Therefore,the paragraph-level encoder of the hierarchical encoder utilizes BERT to encode the contextual semantic information of document paragraphs to obtain the paragraph-level sentence representation.The document-level encoder further encodes the semantic correlation among sentences in different paragraphs to obtain global sentence representations.To apply extractive summarization to real-world scenarios with limited reference summaries,this thesis combine extractive summarization with RL and design reward functions based on downstream NLU tasks of text summarization to guide the model to extract semantically salient and reliable summaries.Experimental results demonstrate that the proposed method achieves better performance on multiple long document summarization datasets.· This thesis studies contrastive learning paradigms and sampling-based reward mechanism for RL-based autoregressive summarization to alleviate the “exposure bias” problem.This thesis proposes a contrastive learning training paradigm for RL-based auto-regressive summarization to avoid the “error propagation” problem caused by the “exposure bias” problem.By constructing and sorting candidate summaries of the document,the contrastive learning paradigm can encode the sorting information which enables the model to judge the semantic quality of summaries.Even if the summarization model deviates from the target summary in the inference stage,it can recognize the best summary from candidate summaries that are constructed in subsequent decoding steps to avoid “error propagation” problem.Additionally,this thesis designs a sampling-based reward mechanism to reasonably evaluate extractive summaries to get better summarization policy.Experimental results demonstrate that the proposed method can effectively improve the results of autoregressive summarization methods on both short and long summarization datasets.· Inspired by prompt learning and transfer learning,this thesis studies extractive text summarization methods for low-resource scenarios.For the scenarios with only small-scale summarization data sets,this thesis reformulates extractive summarization as a textual paraphrasing task between the document and its candidate summaries,which aims to minimize the training gap between text summarization tasks and PLMs in order to retrieve knowledge from PLMs for summariation tasks.Additionally,this thesis transfers the knowledge of textual paraphrasing tasks to text summarization tasks through transfer learning to guide the model to recognize semantic salient summaries,thus reduce the requirement of summarization methods for training data.Experimental results demonstrate that the proposed method outperforms the state-of-the-art extractive summarization methods in all low-resource settings.To summarize,this thesis proposes a BERT-based “sentence-paragraph-document”hierarchical encoder that utilizes PLMs to improve text representation of summarization models.Moreover,to tackle the problem of real-world documents lacking humanauthored summaries,this thesis uses related NLU tasks of text summarization to provide supervision signals for model training.Additionally,this thesis proposes a contrastive learning paradigm and a sampling-based reward mechanism for RL-based autoregressive summarization to alleviate the “exposure bias” problem.Furthermore,this thesis proposes an extractive summarization method specifically for low-resource scenarios.Through the aforementioned researches,this thesis expands the application scope of extractive summarization methods.In future work,this thesis will combine extractive summarization with Pre-trained Abstractive Summarization Models to improve the readability of summaries while ensuring their faithful to the main idea of the document.
Keywords/Search Tags:extractive summarization, hierarchical encoder, reinforcement learning, contrastive learning, transfer learning
PDF Full Text Request
Related items