| Since the new era,patents,the core competency of technology companies,are filed in China alone in excess of 2 million applicatio ns per year.In order to fully exploit these huge information resources,the deep processing technology of patents has received extensive attention.However,patents are very difficult to extract the core subject matter and key innovations of the documents due to their professionalism,large text size and obscure content.The Chinese patent specification for computer communications requires short and concise text when writing the patent abstract,and can visually describe the technical content of the patent,which is a pain point for patent applicants.At the same time,the amount of text in patent texts is quite large,and some domestic studies on long-text abstracts in Chinese have extracted abstracts of up to 500 words,while patent texts need to be proce ssed in the range of 1500 to 3000 words,which is far beyond the processing range of long texts.To solve the above problems,this paper firstly explores and identifies the potential of Bertsum in the direction of patent text abstraction,secondly uses two methods to process the long text of patents respectively,for input representation for Chinese optimization,while using the latest large-scale Chinese pre-training model with patent long text dataset for fine-tuning,and finally performs abstract generat ion by two classifiers respectively,providing two new ideas for the efficiency and quality of patent text abstract extraction,the specific research work is as follows:(1)Aiming at the problems of loss of key information,deviation of abstracts from the core textual subject matter and excessive redundancy caused by long text processing of Chinese patents in traditional text abstract extraction algorithms,the PatBertSum algorithm is proposed,which enables the algorithm to process long patent texts with high efficiency and to generate high-quality long text abstracts.The method is based on the improved Bert Sum algorithm model,using the new CLTPDS patent text dataset,processing long texts by Head-Tail,transforming Chinese input representations,generating sentence vectors using a pre-trained model,and capturing internal text features and text structure features to extract abstracts.Experimentally,this paper demonstrates an 8% improvement in the recall and F-value of ROUGE compared to existing methods.(2)To address the problem that the PatBertSum model is too coarse in processing long patent texts and generating abstracts,which affects the final abstract effect,the Pool Bert Sum algorithm is proposed.By building an extractive text abstraction algorithm model for long Chinese patent texts,the algorithm is able to process long patent texts with high quality and generate higher level long text abstracts.The method is based on the improved Bert Sum algorithm model,which processes long texts by pooling,transforms Chinese input representations,generates sentence vectors using a pre-trained model,and finally uses a Transformer as a decoder for feature classification to extract summaries.Experimentally,this paper demonstrates that the method improves the recall rate and F-value of ROUGE by 15% compared with existing methods. |