| Dunhuang literature is a shining pearl in the long river of Chinese civilization,and the country attaches great importance to the collection and organization of Dunhuang literature.Among the multilingual ancient manuscripts unearthed in the Tibetan scripture cave,the number of Tibetan manuscripts is second only to Chinese.Collecting,organizing,and digitizing Dunhuang Tibetan literature has important academic value and practical significance for studying Tibetan history.Although the research on Tibetan text recognition has been going on for a long time,the research on Dunhuang ancient book text recognition is still in its infancy.Compared with ordinary handwritten documents,Dunhuang ancient literature has a complex layout,with eroded fragments leading to edge damage,and some texts are difficult to distinguish even with the naked eye;The second is that the handwriting is blurry and the font style is diverse.The training data of Dunhuang Tibetan ancient books is also scarce,and existing research results on Tibetan recognition cannot be directly applied to Dunhuang Tibetan literature.Text recognition technology has evolved from single character recognition to end-to-end multi line and full page text recognition,where the position information of characters can be implicitly learned by simply transcribing labels.In order to solve the data problem and multi style character recognition,this paper uses transfer learning to improve the whole page model based on Encoder Coder.The specific work is as follows:1.Analyze the characteristics of the Dunhuang Tibetan Literature collected by the French National Library,propose a set of annotation methods for Dunhuang ancient texts,and construct a Dunhuang Tibetan literature dataset.To solve the problem of data scarcity,traditional data augmentation and printed data augmentation methods are used to amplify the data separately;In order to improve data quality,ancient books are binarized using multi-point annotation information.2.Study the recognition method of Dunhuang Tibetan ancient books using transfer learning.Firstly,pre training is carried out at the row level,and then the entire page recognition model is initialized using the encoder section before training.Aiming at the problem that the model is difficult to converge and there are few data samples,this paper explores the performance of different feature extraction networks on test sets based on Encoder Decoder framework in the case of transfer learning.3.Study the full page recognition method of Dunhuang Tibetan ancient books.Using Origami Net,IFA,VAN,and SPAN algorithms for whole page recognition of ancient books,analyzing the recognition effects of different models on different printed datasets.On the Dunhuang Tibetan literature data set,the line compression module and decoder part of the model are targeted to be improved.Through Tibetan preprocessing,transfer learning,the improved whole page recognition module and the addition of synthetic data,and through ablation experiments to verify the effectiveness of the improved network for character recognition,the common errors in the recognition of Dunhuang Tibetan ancient books are summarized.This article has achieved the following research results:1.This article constructs a Dunhuang Tibetan literature dataset using images from the first volume of the "Dunhuang Tibetan Literature Collection of the French National Library" combined with multi-point annotation methods,which includes 3977 row images and 1000 full page images,and synthesizes 5468 full-page printed image datasets of different quality.2.Realized the recognition of Dunhuang Tibetan literature based on transfer learning.The feature extraction model was compared with the VGG and Res Net series for experiments,and the error rates were reduced from 57.77% and 72.77% to 17.29% and 51.22%.It is verified that transfer learning has lower recognition error rate and faster convergence speed under the Encoder Decoder framework.3.This article uses Origami Net,VAN,SPAN,and IFA algorithms to achieve an average error rate of 0.105%,17.763%,9.29%,and 4.22% on printed datasets.The improved model’s character error rate in the Dunhuang Tibetan literature dataset has been reduced from 5.57% to 4.84%.Based on the above experiments,this article categorizes common errors into four categories: handwriting style,similar fonts,layout issues,and other issues. |