| With the rapid growth of internet users and the explosive growth of various text data,the excessive amount of data has actualy affected people’s efficiency in obtaining effective information.Viewing abstracts can provide a general understanding of long text information.Automatic text summarization is a very important research direction in the field of natural language processing.In recent years,generative summarization has become the mainstream,but there are the following problems in the field of long text summarization: lack of appropriate data sets,it is difficult for abstract generative model to obtain the characteristics of unknown words,and the generated content is easy to repeat.In response to the above issues,this article has done the following work: constructing a technical paper dataset,training a technical paper automatic summarization model,and designing and implementing a technical paper automatic summarization system.First of al,this thesis collected thousands of papers in the computer field from relevant academic websites,and used these papers as the raw materials of the dataset,and used three methods of weight strategy,topic model and abstract algorithm to filter the content of the original paper,select important sentences to form a new content subject,as the original text of the summary to be generated,while the original paper summary is used as the reference summary of the dataset.Then,this thesis designs a text abstract generative model based on the ALBERT-UNILM structure.This thesis selects the ALBERT model as the semantic understanding model,and uses the full word coverage algorithm to fine tune the mask language model.The input text is processed according to the Chinese syntax.This thesis also selects the UNILM model as the abstract generative model,and uses the pointer network to obtain the characteristics of unknown words,And using coverage mechanisms to aleviate the problem of duplicate generation,these improvements have effectively improved the effectiveness of the model in summary generation tasks.In the control experiment designed in this article,compared with the currently performing BERTUNILM model,the model designed in this article achieved improvements of 0.61,0.22,and 0.47 on the NLPCC dataset on Rouge-1,Rouge-2,and Rouge-L indicators,and achieved improvements of 1.08,0.97,and 0.94 on the LCSTS dataset,On the self built dataset,improvements of 0.37 and 0.33 were achieved in the Rouge-1 and Rouge-L indicators.Finally,this article used software engineering technology to conduct requirement analysis and system design for the text summarization system,and implemented a prototype system for automatic summarization of Chinese technical papers using front-end and back-end development technology.The system’s usability and reliability were verified through system testing.In summary,the main achievements of this article include the construction of a technical paper abstract dataset,the training of a technical paper automatic abstract model,and the design and implementation of a technical paper automatic abstract system.This system can help users extract abstracts from computer related technical articles for their reference and use. |