| The vast collection of historical documents and books is a rare wealth of modern human civilization,which provides an important reference for people to understand people's life style,science,technology and cultural development in history.In the vast historical documents,Tibetan historical documents are a precious set of historical and cultural heritage created by the Tibetan people.However,due to few relatively studies on the images of these documents in recent years,the Tibetan historical documents have encountered great difficulties in the digitization process.Segmentation of lines of text is an important part of digitization of historical documents and has a crucial influence on the accuracy of character recognition.However,most of the existing research results on text segmentation are only provide form some popular languages such as Chinese,English and other Latin language.Researches on the segmentation of historical documents in Tibetan are scarce.In order to solve the above problems,this paper studies the method of text line segmentation in Tibetan historical documents and proposes two methods of text segmentation based on Tibetan syllable baseline.It can not only adapt to curved and oblique text lines,but also effectively deal with the overlapping and blocking phenomena existing between the text lines and the lines.Experimental results show the effectiveness of the proposed method.The main contents of this paper are as follows:Firstly,this paper summarizes the research background of the historical document image text line segmentation technology and briefly summarizes the types of historical document textline segmentation methods.In view of the Tibetan documents,this paper introduces the composition deconstruction of the Tibetan syllable,analyzes the Tibetan strokes that mainly affect the segmentation,and the main factors that influence the segmentation.Reference to other literature,this paper gives the evaluation criteria for Tibetan textline segmentation algorithms.Secondly,this paper introduces two kinds of baseline-based method for textline segmentation in Tibetan historical documents.The first one calculates the number of line of textline and the starting position of baseline by template matching.The baseline estimated through the dynamic tracking points.The second method extracts the baseline of each Tibetan syllable by using the Sobel operator and connecting it to textline from left to right.Finally,by analyzing the connected components between two baselines,the line for segmentation further analyzed Line cutting points to complete segmentation.Experimental results show that compared with the projection-based segmentation method,the proposed method obviously improves the segmentation accuracy.Thirdly,this paper presents a method for the segmentation of Tibetan historical documents based on graph models.This method skeletonizes Tibetan documents and constructs a graph model.Compute the starting node and terminal node for each row from the graph model.Then use the A* algorithm to find the shortest path from the start node to the terminal node,which is final the text line segmentation path.Experiments show that compared with the projection-based method,the proposed method not only greatly improves the accuracy of segmentation,but also can deal with the curved and inclined Tibetan text lines more accurately.Some sticky strokes also could be cut properly.Finally,this paper developed an end-to-end Tibetan document image digitization system.The system uses a deep learning method,training identification model,the collected images of Tibetan literature directly into Tibetan text.In the conversion process,the system also provides the user interface to manually correct the segmentation error. |