| With the rapid development of the Internet,text data on the Internet is exploding.It is a great challenge for us to acquire valuable information quickly.Automatic text summarization aims at using computer technology to compress and summarize the main content of text automatically,which can improve the efficiency of people to obtain information.At present,most of researches on automatic text summarization take short texts such as news and reviews as the research content,and these researches have made some progress.However,with the continuous update of application scenarios,the need for summarizing long texts is increasing.The methods of automatic text summarization are divided into two methods:extractive and abstractive,but both of which have great limitations in summarizing long texts.Aiming at this problem,this paper uses a two-stage summarization method to deal with the long text summarization task,which divides the process of summarizing long texts into two stages: key sentence extraction and summary generation.Besides,this paper designs different summarization models to accomplish the goals of the two stages.The main work and contributions are as follows:(1)The pretrained model BART performs well on the short text summarization tasks,but it cannot handle long texts due to the design of its own model structure.BART is studied in this paper,and based on BART,we constructs Long-BART(LBART)by sparse self-attention mechanism and extended positional encoding.LBART can handle longer texts,and it is suitable for the long text summarization task.In addition,this paper proposes a variety of strategies to reconstruct the training data,which can avoid the low utilization of the training data caused by the length limitation of the model and effectively enhance the training effect of the model.(2)This paper propose a hierarchical encoding-based key sentence extractor named Hierarchical Extractor(Hi Ext),which is used to extract key sentences from long texts and guide the generator to generate higher quality summaries.Existing extractive summarization models often ignore the hierarchical structure of long texts.Therefore,when designing the extractor,we use the idea of hierarchical encoding to fully mine the rich hierarchical information in long texts.First,we use the hierarchical encoder to obtain the encoding information of sentences,sections and documents in long texts,and then the attention mechanism is used to fuse these encoding information to improve extraction effect.(3)This paper combine the extractor and the generator to form a two-stage summarization model,and in order to demonstrate the superiority of this model,we compare it with ten other summarization models on two public datasets,Pubmed and Arxiv.According to the results of the ROUGE evaluation system,on the Arxiv dataset,the ROUGE-1,ROUGE-2,and ROUGE-L scores of our model reached 48.016,20.116,and 42.593,respectively,surpassing all comparison models;On the Arxiv dataset,the ROUGE-1,ROUGE-2,and ROUGE-L scores of our model reach 47.982,20.863,and42.315,respectively,where it ranks first in ROUGE-1 score and second in both ROUGE-2 and ROUGE-3 scores among all the compared models. |