Font Size: a A A

Segmentation And Recognition Of Touching Character String For Tibetan Historical Documents

Posted on:2020-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q C ZhaoFull Text:PDF
GTID:2415330623456679Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the long history of human development,people of all ethnic groups have left precious historical footprints.As an important information carrier,historical documents are of great value and far-reaching significance for studying people's early production and life style.Tibetans are one of the ethnic minorities in China that have a long history and culture and have their own language and script.The Tibetan people have created a unique plateau culture and left a rich cultural heritage in many fields.Tibetan historical literature is an important way to study Tibetan historical culture and Tibetan Buddhism,and have received extensive attention from many scholars recently.However,due to the long history of Tibetan historical documents,every review may cause devastating damage to the literature.Digital protection of Tibetan historical documents not only protects paper-based historical documents,but also improves the utilization of documents.Early historical documents were mostly printed on woodcut boards.Along with the influence of ink diffusion and humidity,a large number of touching character string was generated in historical documents.The research about touching character in English,Chinese,Japanese and Arabic numerals has been fruitful but there is no research on Tibetan documents.In order to explore the segmentation and recognition of touching character string for Tibetan historical documents,the main research work of this paper is as follows:Firstly,this paper introduces the segmentation and recognition of touching character in many languages at home and abroad as well as the research status and frontier trends of Tibetan historical documents,analyzes the research work of scholars on this subject,and summarizes the common methods and techniques frequently used in dealing with this problem,which provides a good reference and help for the followup research work of this paper.Secondly,there is no publicly published database in this field because of the lack of research on Tibetan historical documents.This paper screens 7,500 touching character string in the Tibetan historical document image by connected component analysis,and uses the XML file to mark the touching point coordinates and the category of the touching character string,so as to build the first database about the touching character string of Tibetan historical documents.In this paper,the classic drip algorithm is improved by using the shortest path,which makes the segmentation path of the touching character string more reasonable.After experimental analysis,we find that the recall rate of the improved drip algorithm reaches 73.02% in the simple touching character string.Thirdly,after analyzing the structure of Tibetan characters,this paper proposes an over-segmentation algorithm based on contour feature point detection.The algorithm firstly divides the Tibetan characters into upper vowel area and consonant area by Tibetan baseline;then,for the upper vowel area,using SVM-upper vowel classifier to filter the feature points of the upper vowel area,and for consonants area,using rules to filter feature points;finally,using feature points to construct a segmentation path.The recall rate of this method reached 81.42% in the touching character string with complex cohesive patterns.Finally,the deep learning framework is used to develop complex Tibetan historical document text and Tibetan Latin transfer text recognition system,which realizes the image text recognition function.
Keywords/Search Tags:Tibetan historical documents, Benchmarking database, Touching character String, convolutional neural network
PDF Full Text Request
Related items