Font Size: a A A

Research Of Chinese Handwritten Text Segmentation Algorithm

Posted on:2010-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y T QuFull Text:PDF
GTID:2178360272996393Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
1.IntroductionRecent years OCR system has rapid development,people can easily input the characters on paper into computer in order to process the data fast and efficiently with computer.OCR technology has been researched for decades in foreign countries,from ABC to words and to text,from print text to handwritten text.Chinese characters OCR system also has a rapid development in china.Recognition rate is quite high for print chinese characters,Han Wang software has successfully researched several products.However the recognition software for handwritten text in chinese can not handle the problems successfully,the recognition rate is still not high.OCR is a complicated process,there are many factors that can influence the recognition rate.Early period people tried to optimize the classifier to obtain high recognition rate,but the premise is that there is only one character no matter print or handwritten.For the performance of classifier has been promoted a lot,recognition rate for single character is high enough for commercial use.With the development of the demand for handwritten text recognition,how to raise the recognition rate of OCR system becomes very important. Unlike OCR system for print which focus on classifier.The research of OCR system for handwritten text is mainly on character segmentation.Statistical analysis showed that the mistake made by missegment is more than the mistake made by classifier.This is decided by the feature of handwritten text.There are more randomness and the lines are not horizontal, beside that,handwritten chinese characters are more like overlapped and the gaps between characters are smaller.So this is the difficulty of handwritten chinese characters.Hidden markov model is a statistical model which is developed from the markov model. Viterbi algorithm is a very important algorithm used to solve hidden markov model problems. Genetic algorithm is a efficient,parallel,self-adapting searching algorithm which can provide a common frame to tackle the complicated optimizing problems,and the has a strong robustness.The two algorithms have many application value in the area of character segmentation. 2.Research ContentThis paper research the character segmentation part of OCR system.In OCR systems text firstly need to be segmented into single characters.Recent years researchers proposed many algorithms,some of which are very good.This paper is on the basis of others proposing my own ideas,through the test which proves the algorithm is efficient.All in all, the main research work is as follows:(1) Analysis and choose of smoothing filter.The most important work of pre-process is de-noising.There are many smoothing filters which have different features.Spatial filter can increase the processing speed and mean filter is good for exacting big target,so we choose mean filter.According to feature of the image,the choice of the size of mask is as follow: isolated point that is smaller than the stroke width is considered as noise,that is to say the size of the mask is decided by the estimate of the stroke width,and the estimate of the stroke width is done by the mean of horizontal and vertical scan.(2) Introduction of the methods of image binary.Including Double peak value method, Ostu method,Iterative method and Minimum error method.Through the analysis of the image,Ostu has been chose to process image for binary.(3) Analysis and comparation of common methods for character line exaction feature and limits.The application range of direct projection method is small,and can not tackle the overlapped line;indirect projection method has good efficiency for global incline text,after correction the projection method can exact the line,but under the condition that some lines are not horizontal and the gaps between lines are small the performance of projection method is poor.This paper proposes a multi-step searching method to line exaction which can tackle this problem well.In the test,for 50 handwritten texts the correct segmentation rate is 95.8%.(4) Using improved Viterbi algorithm to exact non-touching character,this method is very useful for overlapped character.When the Chinese character width can be estimated,we can use width discriminance to determine the stick components,and then using stroke analysis to find the site of stick and combining with Viterbi algorithm to produce nonlinear segment path,pulsing the original path to form the final path.The test showed that the combination of Viterbi algorithm and stroke analysis is very efficient.(5) Research the mind and theory of genetic algorithm and introduct the operation of genetic algorithm,including selection,crossover,mutation.Compare the features of fitness assignment,selection methods,the principle of cross and recombination,the influence of different variation on algorithm.(6) Propose that the component dicided by segmentation path is the Chinese character probability as the concept of character forming probability.The paper proposes that the character forming probability of the component is the product of three probabilities.The three probabilities are decided by the width of component and average width diffirency, interal characteristic distance and external characteristic distance.(7) Design genetic algorithm which optimizes the candidate path,using the average character forming probability as the fitness function.Design the preservation and deletion of the marking path of the logic code gene.Select the right parameter to void poor convergence or local maximum.The test showed that the genetic algorithm of this paper performances efficiently and searches rightly.The accuracy of character segmentation is 94.5%using global statistic.3.ConclusionLine exaction and character segmentation are main steps in this paper,the mutil-step searching nonlinear line exaction algorithm the paper proposed is easy and the accuracy is high,which can tackle the some weaknesses of direct projection method and indirect projection.The accuracy of segmentation is 95.8%.In the three steps of the character segmentation:nonsticky character segmentation,sticky character segmentation,candidate path optimization,we use Viterbi algorithm,stroke analysis combining with Viterbi algorithm and genetic algorithm respectively.The algorithm is effective and the global accuracy of candidate path optimization is 94.5%.
Keywords/Search Tags:Chinese handwritten text, character segmentation, Digital image processing, genetic algorithms
PDF Full Text Request
Related items