Font Size: a A A

Research On Key Technologies Of Word Segmentation In Chinese Patent Documents

Posted on:2023-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2558306848458244Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Patent documents record the latest inventions in various fields,which are of great significance to scientific and technological innovation and development.The word segmentation of Chinese patent documents is a very basic and critical task in the research of natural language processing of Chinese patent documents,which plays an important role in the downstream tasks of patent document information processing,such as patent machine translation and patent retrieval.With the development of deep learning technology,various neural network word segmentation models have been proposed one after another,and the performance of Chinese word segmentation has seen a leap in its quality.However,the word segmentation research for Chinese patent documents still remains on the traditional word segmentation method,and the word segmentation technology needs to be improved urgently.Today’s mainstream neural network word segmentation models generally rely on large-scale annotated corpora.However,there is a lack of large-scale public annotated corpora in the patent field,which is also an important challenge for Chinese patent document word segmentation research.In this context,this paper conducts an in-depth study on the key technologies of Chinese patent document segmentation.On the one hand,we study the solution of nested patent term recognition;on the other hand,we study how to improve the performance of the Chinese patent document segmentation model by using the language information in unlabeled corpus under the limited annotated dataset.The follows are the main contributions of this paper:(1)To solve the problem that there is no public annotated corpus in the patent field,this paper combines the existing word segmentation system to automatically divide words and manually cross-check according to the segmentation rules to construct a word segmentation dataset in the patent field.(2)Aiming at the problem of a large number of nested terms in patent documents,a graph-based word segmentation method for Chinese patent documents is proposed.The two-way interaction between the graph module and the outer module makes full use of terminology knowledge within the two modules,while improving nested terminology recognition performance and word segmentation performance.In addition,it is proposed to use the Bert mask language model to continue pre-training on the basis of large-scale unlabeled patent document data to obtain better initialization parameters,the performance of the model has been further improved.(3)A semi-supervised Chinese patent document word segmentation method based on collaborative training is proposed.This method is based on the idea of Tri-training collaborative training algorithm.In the training process,three initial word segmentation models are used to mutually complement in advantages,and trusted samples are selected to be added to the training set,which gradually improves the performance of the model and alleviates the problem of lack of labeled data.
Keywords/Search Tags:Patent documents, Chinese word segmentation, Collaborative training, Natural language processing
PDF Full Text Request
Related items