Font Size: a A A

Research On The Establishment And Application Ot The Sample Database Of Tangut Script

Posted on:2019-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:W H YangFull Text:PDF
GTID:2405330551454403Subject:Engineering
Abstract/Summary:PDF Full Text Request
The digital information of ancient books is beneficial to the protection and exchange of ancient books and is the main channel for the study of ancient books in modern society.Tangut script is a kind of ancient script that records the Dangxiang nationality.Through the Tangut script in ancient books,we can fully understand the social and historical forms and the national culture of Western Xia Dynasty at that time.Therefore,it is an important way for us to study Tangut script by excavating and preserving the ancient literature of Tangut.However,because of the long history,there are very few ancient books in Tangut period,and there are many phenomena such as paper damage and unclear writing,which hinder the digital development of Tangut script.Nowadays,optical character recognition?machine learning and other techniques will greatly help people to interpret ancient script,but these technologies are based on character databases,which provides training samples and evaluation standards for the character recognition.Therefore,the establishment of the standard,open and universal Tangut Script sample database is the premise and foundation to carry out the research of the Tangut character recognition.The Tangut script sample database not only provides the test samples and evaluation standards for the intelligent recognition algorithm of the Tangut script,but also compensates for the scarcity of specialists who can master the Tangut language system.which provides more convenient scientific research tools and efficient scientific research methods for the Tangutology researchers.and also provides a strong support for the way and content of the digital literature information retrieval of ancient books.At present,the establishment of the sample database for the identification of Tangut script is still in the blank stage.This paper focuses on the research of the establishment and application technology of the sample database of the Tangut script.Firstly,the Buddhist sutra in Tangut script are selected as the data source.Then the scanned ancient texts are preprocessed and texts are extracted.The extracted textual image information is organized into Tangut script sample database,including text sample database and single character sample database.The text database is organized in the form of Excel tabular files.By reading the information in the excel table,the user can easily query Tangut characters and improve the traditional annotation form,while the single-character database is organized in the order of the character frequency.The single-character image file is named strictly according to the regulations,so as to ensure that the researchers of the Tangutology search ancient books and documents through the database.It is easy to find out in which documents Tangut character has appeared and how it has been translated.Finally,based on the sample database created,Tangut script intelligent identification research was conducted.The deep learning model was established using convolutional neural networks to train and learn the Xixia dataset.In order to solve the problem of unbalanced samples,The MLSD is used to expand the samples to improve the performance of the learning and recognition algorithm for the Tangut script.In a word,we established a sample database of Tangut Script with theoretical research and practical application value,which is of great benefit to the development of the digitalization of Tangut script.
Keywords/Search Tags:Tangut Script, data source, character extraction, sample database, deep learning
PDF Full Text Request
Related items