Extractive Abstracts Of Long Chinese Patent Texts Based On Improved Bertsum Model

Posted on:2024-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:P Qin

Full Text:PDF

GTID:2568307094959219

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Since the new era,patents,the core competency of technology companies,are filed in China alone in excess of 2 million applicatio ns per year.In order to fully exploit these huge information resources,the deep processing technology of patents has received extensive attention.However,patents are very difficult to extract the core subject matter and key innovations of the documents due to their professionalism,large text size and obscure content.The Chinese patent specification for computer communications requires short and concise text when writing the patent abstract,and can visually describe the technical content of the patent,which is a pain point for patent applicants.At the same time,the amount of text in patent texts is quite large,and some domestic studies on long-text abstracts in Chinese have extracted abstracts of up to 500 words,while patent texts need to be proce ssed in the range of 1500 to 3000 words,which is far beyond the processing range of long texts.To solve the above problems,this paper firstly explores and identifies the potential of Bertsum in the direction of patent text abstraction,secondly uses two methods to process the long text of patents respectively,for input representation for Chinese optimization,while using the latest large-scale Chinese pre-training model with patent long text dataset for fine-tuning,and finally performs abstract generat ion by two classifiers respectively,providing two new ideas for the efficiency and quality of patent text abstract extraction,the specific research work is as follows:(1)Aiming at the problems of loss of key information,deviation of abstracts from the core textual subject matter and excessive redundancy caused by long text processing of Chinese patents in traditional text abstract extraction algorithms,the PatBertSum algorithm is proposed,which enables the algorithm to process long patent texts with high efficiency and to generate high-quality long text abstracts.The method is based on the improved Bert Sum algorithm model,using the new CLTPDS patent text dataset,processing long texts by Head-Tail,transforming Chinese input representations,generating sentence vectors using a pre-trained model,and capturing internal text features and text structure features to extract abstracts.Experimentally,this paper demonstrates an 8% improvement in the recall and F-value of ROUGE compared to existing methods.(2)To address the problem that the PatBertSum model is too coarse in processing long patent texts and generating abstracts,which affects the final abstract effect,the Pool Bert Sum algorithm is proposed.By building an extractive text abstraction algorithm model for long Chinese patent texts,the algorithm is able to process long patent texts with high quality and generate higher level long text abstracts.The method is based on the improved Bert Sum algorithm model,which processes long texts by pooling,transforms Chinese input representations,generates sentence vectors using a pre-trained model,and finally uses a Transformer as a decoder for feature classification to extract summaries.Experimentally,this paper demonstrates that the method improves the recall rate and F-value of ROUGE by 15% compared with existing methods.

Keywords/Search Tags:

natural language processing, patents automatic summarization, Bertsum algorithm, ROUGE index

PDF Full Text Request

Related items

1	Automatic Summarization Of Multimedia Information And Related Technology Research,
2	Automatic Summarization System Based On Natural Language Processing
3	Research On News Text Automatic Summarization Based On BERT Model
4	Research On Automatic Summarization Of Microblog Events
5	Research On Automatic Text Summarization Based On Entity Information Embedding
6	Research On Automatic Text Summarization Algorithm For Chinese And English Long Text
7	Submodularity in Natural Language Processing: Algorithms and Applications
8	Research On The Algorithms For Automatic Summarization Of Single Text Documents In Uyghur
9	Research On Key Techniques Of Query-focused Multi-document Summarization
10	Research On Automatic Summarization Algorithm For Meeting Speech Transcribed Text