| The continuous progress of computer technology and the explosive growth of all kinds of information data have led to the rapid development of artificial intelligence.There are a large number of text images in natural scene images and contain rich information,how to effectively extract the information of text has become one of the hot spots of research at this stage.The purpose of natural scene text detection is to detect the location of text instances in a natural image,and natural scene text recognition is the downstream task that follows the detection task in the natural scene text detection and recognition task,and the purpose is to identify the text content of the candidate region output from the input image processed by the detection module and output the corresponding string.Natural scene text detection and recognition technology plays an important role in computers helping humans to process text image data,and is widely used in many fields such as automatic navigation,intelligent information entry,scene recognition and multimedia retrieval,reflecting a broad application prospect.However,the text images in natural scenes are different from the text images in traditional OCR(Optical Character Recognition)applications and have the following difficulties: the text in natural scenes is no longer single and has characteristics such as flexible angles,different sizes,rich colors and variable font shapes;the background of natural scenes is very complex and also some background information is very similar to the text area information,such as fences,flagpoles,etc.;in the actual application scenes the The image quality of text images in practical application scenes is unstable and can be affected by factors such as shooting equipment,shooting level and lighting conditions,which makes the natural scene text images have poor clarity and text lines are obscured.The above characteristics seriously affect the accuracy of the natural scene text detection and recognition algorithm,so that it does not guarantee the high quality and stable image with high resolution and less background interference information in practical applications.Therefore,to address the above problems,this thesis improves the natural scene text detection and recognition module respectively,and the details of the research are as follows:In natural scene text detection,a natural scene text detection model using an instance segmentation approach and based on an attention mechanism is designed.The model uses Swin-Transformer+FPN structure as the feature extraction module.Swin-Transformer performs patching operation on the input image,then performs self-attention calculation in each patchs,interacts with information between windows and obtains image global information by the operation of moving windows,and then performs downsampling operation through four stages in turn to output four feature maps of different sizes.Up-sampling and feature fusion operations are performed on the feature maps to achieve full fusion of multi-scale features,so that shallow features also have deep high semantic information,allowing large scale feature maps to perform prediction of small targets and small scale feature maps to perform prediction of large targets.Subsequently,the four feature maps are fused,and the fusion is performed by the function C(.)The 1024-channel feature map F is obtained,and F is fed into the progressive scale expansion algorithm module,which draws on the PSEnet scale increment method to solve the problem that segmentation methods are difficult to separate text instances that are close to each other.In natural scene text recognition,based on the classical CRNN algorithm,this thesis enhances the performance of recognizing text images with various shapes such as curved and skewed in natural scenes by adding a trainable text correction module(TPS transform network)to the front end of the CRNN architecture to perform shape correction for text images with irregular input shapes;using a residual neural network on the backbone network Res Net-50 replaces the VGG16 model used in the original CRNN for feature extraction,learns the deep semantic information of the input image,and overcomes the problems of gradient disappearance,network performance degradation,and high computational complexity brought by the original VGG16;uses a two-layer bidirectional LSTM structure to form the recurrent layer of the CRNN model,which does not independently predict the next Finally,in the transcription layer with CTC algorithm as the core,the sequence encoding information is translated using CTC algorithm to obtain the final recognition results.In this thesis,we verify the effectiveness of the proposed model by comparing its performance with some published methods on several publicly available datasets.The comparison results show that the model in this thesis has a higher accuracy rate than other models on different test sets,proving that the proposed model is effective and widely applicable for text detection and recognition in a variety of natural scence. |