| The scene text detection and recognition based on deep learning has become an important part of life,medical,education,search,finance,translation,and unmanned vehicles,among other things,in recent years due to the continuous development of deep learning technology and the advanced semantic information contained in the text itself.In the area of computer vision,it is one of the most valued and applied prospects.The primary distinction between text recognition in natural environments and text recognition in documents is that text images collected in natural scenes frequently have complex backgrounds and differ significantly in terms of style,aspect ratio,orientation,alignment,and stacking.On the other hand,it is influenced by human influences,which results in randomness in natural scene text graphics.The problem of text identification and recognition in natural settings is quite difficult because of things like the quantity of text samples,shooting angles,illumination occlusion,and ambiguities brought on by text spacing.In order to overcome the challenges of detecting and recognizing arbitrarily shaped text in natural scenes,this thesis conducts extensive research on issues related to deep learning-based text detection and recognition for arbitrarily shaped scenes.It focuses on how to build more accurate text detection algorithms and more efficient,lightweight,and accurate text recognition models for complex scenes.The following are the research efforts and contributions of this study.(1)A multi-scale residual orthogonal-channel attentional scene text detection network is presented to solve the issue of low differentiation between text and non-text regions in complicated natural situations,which results in false detections and missed detections impacting detection accuracy.The network first employs Res Net50 as its backbone network before redesigning and building a multiscale pyramid network that pays close attention to detail.In turn,this makes it possible for later classification regression networks to forecast text attributes on multi-scale feature maps with more detailed features.Second,we created a feature augmentation module that aggregates local and global characteristics to address the issue of the lack of distinction between text and non-text regions in text photographs of natural sceneries.In order to strengthen nontextual area features at the spatial and channel levels while weakening textual region features,the local enhancement module uses residual orthogonal attention and residual channel attention.Establishing long-distance connections between characters and reducing the impact of extraneous elements are both capabilities of the global feature module.In order to help the residual orthogonal attention mechanism produce stronger attention weights,we also build a loss function for it.Following comprehensive testing,we discover that our suggested strategy performs at its best on the CTW1500 dataset and achieves good results on the other two datasets,thus demonstrating the method’s efficacy.(2)The excellent scene text detection algorithms of recent years are examined in this thesis,and it is discovered that the majority of these algorithms only use local or global one-sided information,and that their number of parameters and inference speed make it challenging for them to meet the practical requirements of real applications.In order to fully characterize text features and increase recognition accuracy,this work studies and comes to the conclusion that scene text recognition has to rely on both local and global information.This research suggests a Transformer-based scene text recognition algorithm in light of the above findings.Initially,perspective-distorted text images are fixed using a text image correction network.In order to completely capture the local detail aspects of character components from both sides and describe the long-term interdependence between various characters,we secondarily create a hybrid module integrating local and global.Finally,to improve the performance of the feedforward network,we build a bidimensional dependency-enhanced feedforward network that explicitly models both spatial and channel dependencies.For numerous English and Chinese datasets,we ran comprehensive tests to determine how well our system worked.The experimental results demonstrate that,despite significant advantages in the number of participants and speed,our algorithm can nevertheless perform optimally on the Chinese dataset and produce outcomes that are comparable to those of the current optimal technique on the English dataset. |