| Character Recognition(CR)is a method that uses computer equipment to automatically convert human-understandable text or image information into computer code that can be read,inquired,and edited by the computer.Tibetan language is an important carrier of Tibetan culture.The number of books published in Tibetan language is second only to Chinese in history.Tibetan language is a treasure of Chinese culture and has important humanities scientific research and application value.Tibetan text recognition is an important research content in the subject of Tibetan computational language,involving information science,mathematics,language science,cognitive science,and other fields,and it is also a goal of artificial intelligence.Therefore,the use of Tibetan character recognition to protect and use Tibetan ancient documents has become an important research hotspot in the digitization of document resources.However,due to the immaturity of technologies such as nondestructive collection,layout analysis,and character recognition of Tibetan ancient books and documents,many precious documents and easily damaged paper documents cannot be collected and digitized.The digitized document resource data of Tibetan ancient books are also mainly image information,so there is a lack of data support for document content mining,knowledge base construction,and retrieval technology research and development.Woodcut texts account for the largest number of documents in Tibetan ancient books.Therefore,the research on the text recognition of Tibetan ancient books is particularly important.Some institutions and companies have developed Optical Character Recognition(OCR)to identify printed Tibetan text.According to available literature,there is very little research on the text recognition of woodcut Tibetan ancient books.Judging from the existing literature,the domestic and foreign Tibetan text recognition technology still uses traditional method which cannot integrate the Tibetan language structure and text composition rules.The recognition accuracy is low,and the generalization ability is weak.To protect and use Tibetan documents,it is urgent to develop digital technology with high recognition rate,high accuracy,and high performance.Deep learning technology has developed rapidly in recent years.Compared with traditional recognition methods,recognition methods based on deep learning can achieve better performance.How to implement end-to-end learning and how reduce steps based on manual rules have become a hot topic of current research.The research of this dissertation is deep learning-based Tibetan ancient woodcut text recognition which proposes a new method of Tibetan ancient woodcut text recognition based on deep neural networks.The main works are summarized as follows:(1)Through the analysis of traditional text detection methods and the combine of the complex layout features of the woodcut Tibetan ancient books,this dissertation implements CTPN-based detection method for woodcut texts of ancient Tibetan books.This method focuses on the text positioning algorithm based on CTPN,to realize the vertical and horizontal detection of woodcut Tibetan ancient books.(2)By using the sliding window-based text line splitting technology of Tibetan ancient books,the ultra-long text line is dynamically split into multiple sub-recognition strings,and the adjacent sub-string overlapping character processing based on the character recognition position information is used to solve the super long text line.(3)Construct a high generalization and robust Tibetan ancient book string recognition model based on residual network and two-way long and short-term memory loop neural network,combined with sample enhancement technology,to solve the problem of poor image quality,serious adhesion of adjacent text,Difficulties in the recognition of ancient texts with large overlap between the upper and lower lines.(4)The spell check method is used to detect the wrong syllables,and the hidden Markov model and the language model are combined to solve the problem of recognition and correction of similar characters. |