| Text is an ubiquitously pattern emerging in natural scene.Many natural scene ob-jects,e.g.street signs,shop signs,billboards,road signs,traffic signs,license plates,posters,contain large amount of text.These text holds rich and accurate high-level se-mantic information about natural scene which plays an important role on scene under-standing.This paper studies the natural scene text extraction problem,including text detection and recognition.Text in natural scene show large varieties on fonts,scales and layouts.Meanwhile,the background in natural scene is quite complex.There are also low resolu-tion,blured,uneven lightening,shadowing and occlusion issues in natural scene.These problems make scene text extraction quite challenging.In view of the complexity of text patterns in natural scene,we abandoned the idea of using rule based methods,but to utlize depth neural networks for their superior repre-sentation ability.We design end-to-end machine learning model to efficiently detect and recognize complex scene text.Meanwhile,we seek to improve the representation ability and robustness of our model as a goal.Scene text,in form of images,has its static nature.At the same time,they shows dynamic nature as character sequences.This study mainly utilize two kinds of deep neural networks for their different aspects of modeling capacity,the convolution neural network(CNN)and recurrent neural network(RNN).The two kinds of networks cover different aspects of scene text.We utlize them to build our end-to-end models.Specifically,this work mainly has the following three contributions:1.We propose a multi-scale,multi-direction natural scene text detection model,the Vertically Regressed Proposal Network(VRPN).The method uses deep convolution neu-ral network for image feature extraction,and motivated by the idea of multi-task learning.We deal with text location identification and text regional regression in the same time.They are carried out in the end-to-end training procedure.We introduce anchors with multiple sizes to detect characters with multiple scales.The text lines are detected through box proposal connection.After that,we can also get the direction of the text line through linear regression which can detect skewed text.Our model achieves good performance on several public datasets,and can detect the multi-lingo scene text.2.We combine probabilistic graph model with deep neural networks,proposing a scene text recognition model,the hybrid CNN-HMM model.We formalize scene text recognition as sequence recognition and utilize hidden Markov model to deal with the text sequence dynamics which can’t be solved by static convolutional networks.Convo-lutional neural networks can only deal with static sized inputs,we use HMM to model the text sequence.Our model utilizes the strong representation ability of CNN and can naturally deal with sequences.The training and recognition process doesn’t need text seg-mentation,which greatly improve the robustness of the model.Compare with traditional methods with hand-crafted features and shallow models,the combination of deep neural network and HMM greatly improves the recognition performance.3.We combine convolutional neural network and recurrent neural network to form an end-to-end text recognition model,the CNN-LSTM-CTC model.The training pro-cess of hybrid CNN-HMM model is an iterative procedure that alternatively train the two models.Due to the use of models with different paradigms,we can’t conduct complete end-to-end training for the hybrid model.Recurrent neural network,as a great sequen-tial neural network model,can be combined with convolution neural network which only deal with static images.Convolutional neural network shows great performance on static classification problems,and the recurrent neural network has flexible dynamic nature.The combination makes the two models complement on their modeling abilities in the text recognition problem.By using the connectionist temporal classification(CTC)as the output layer,we can carry out end-to-end joint training.In the training process,there is no need of character segmentation.The end-to-end joint training exploits more on the modeling capacity.The model performs much better than the hybrid CNN-HMM model,and shows superior performance on a number of natural scene text datasets. |