Font Size: a A A

Research On Automatic Text Summarization In Chinese

Posted on:2024-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:G X HuaFull Text:PDF
GTID:2568307178973949Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of new social media applications has brought massive amounts of data,including a large amount of text-type data.Long texts contain a lot of redundant information and are not efficient to use.How to extract concise and refined summaries from large text data is the core concern of automatic text summarization technology.Among the existing summarization methods,extractive summarization organizes the summaries by sentence scoring and selection.This method of directly selecting candidate summary sentences from the source text to compose the summaries has problems such as inaccurate sentence selection and generate influent summaries.Abstractive summarization generates summaries by understanding and paraphrasing the original text.This method of summarization simulates the process of manually writing summaries.The generated summaries are superior to extractive summaries in terms of readability and fluency,but there is still a problem that the semantic information of source text is not fully utilized,and the generated summary is inconsistent with the facts described in the reference summary.This paper conducts in-depth research on Chinese extractive summarization and abstractive summarization,and proposes an extractive summarization model and an abstractive summarization model that can incorporate more information from source text.Furthermore,this paper also proposes a two-stage training hybrid summary model to further improve the model’s ability to encode long texts.The specific work is as follows:(1)Research the extractive summarization method based on the double tower structure.This method aims to improve the accuracy of sentence scoring by incorporating more source text semantic information,thereby obtaining higher-quality summaries.The two-tower extractive summarization model proposed in this paper uses sentence BERT to obtain sentence representation vectors,uses document BERT to obtain document representation vectors,interacts sentence vectors and document vectors at the information interaction layer,and integrates sentence context information and full-text semantic information of documents into the model,and add external information such as word frequency and relative position to the sentence representation vector to improve the accuracy of sentence scoring.The results of comparative experiments show that the quality of summaries generated by the extractive summary model proposed in this paper is better than that of all compared models.The results of ablation experiments prove the effectiveness of the two-tower structure and information interaction method proposed in this paper.(2)Investigating Encoder Augmented Abstractive Summarization Models.The model is built based on the pointer network.On the encoder side,multiple convolution kernels are used to obtain the semantic information of the original text from different dimensions,and the neural topic model is used to obtain the topic information of the article.In the information fusion layer,the semantic information of the source text encoded by the long-term short-term memory network,the semantic information extracted by the convolutional neural network,and the topic information extracted by the neural topic model are fused to enhance the coding ability of the encoder,and then assist in the generation of summaries.The ROUGE score shows that the encoder enhancement method proposed in this paper does improve the quality of the summary generated by the model.The example analysis intuitively confirms that adding local semantic modules and neural topic modules enhances the ability of the model to obtain original topic information and preserve more complete semantic information.(3)Investigate hybrid summarization methods based on hierarchical encoders and two-stage training.This method expands the ability of the model to encode long text by dividing the source text into sentences and using pre-trained BERT to obtain sentence representation vectors.The model is trained in a two-stage training method.First,the model is trained on the extractive summarization task,and then the model is trained on the abstractive summarization task.Compared with the existing models,the experimental results show that the hybrid summarization model proposed in this paper is superior to all comparison models in ROUGE score,which proves the effectiveness of the hybrid summarization model structure proposed in this paper.The experimental results of the comparison of training methods show that compared with the single training method,the two-stage training method can encode more semantic information,and the summary effect has been improved more than the single training method.
Keywords/Search Tags:Automatic text summarization, pre-trained BERT, neural topic model, two-stage training
PDF Full Text Request
Related items