| The natural scene text detection task is an important part of the computer vision field,aiming at discovering and locating text regions from natural scene images and outputting their location and shape information.Current deep learning based methods for scene text detection have made significant progress,and as research continues,the focus of scene text detection has shifted from horizontal or multi-directional text detection to arbitrary-shaped text detection.However,due to the drastic changes in font,size,color,and direction of arbitrary-shaped texts,the detection results are still unsatisfactory.Currently,there are two main challenges in arbitrary-shaped text detection.One is designing an excellent text instance representation that allows the model to effectively learn the geometric changes of different texts.Existing methods mainly model text instances by regressing the mask or contour point sequence of text regions.However,it is difficult to balance the training complexity and modeling quality of the mask,and the limited number of point sequences is insufficient to capture the contour details of complex text.The second challenge is designing a concise and efficient model without post-processing to accurately learn text instance representation,because the learning ability of existing models is unsatisfactory.To address these two challenges,this work proposes two scene text detection models with the following innovations:(1)Considering current text instance representation methods which are difficult to fit extremely long or curved text accurately,this work proposes a text mask representation method based on the discrete cosine transform.This method utilizes the low-frequency component of the discrete cosine transform to represent the text mask,resulting in a lower training complexity and higher representation quality.Furtermore,to address the issue of sample imbalance in current regression methods based on the divide and conquer strategy,this work proposes a single-level prediction framework.A feature-aware module is designed to obtain rich contextual information and adaptively adjust the receptive field to achieve spatial and scale awareness.Additionally,a text kernel sampling strategy is introduced to adaptively adjust the number of positive samples for balancing the text regression at different scales in the single-level prediction process.(2)To tackle the difficulty in perceiving the entire appearance of complex text using only single regression,this work proposes a text detection method based on multistage contour optimization.A contour optimization module based on the transformer is designed to correct large-scale contour prediction errors by efficiently obtaining global information,and precise and accurate text contour representation is achieved by cascading multiple contour optimization modules.To address the problem of error accumulation in current multi-stage methods,an adaptive training strategy is proposed in this work to enhance the correction capability of the contour optimization module by increasing the potential learning paths of contour optimization.Furthermore,a re-score mechanism is proposed to evaluate the contour confidence at each stage,which suppresses the appearance of false positive samples and improves the classification scores of missed texts.The experimental results on multiple public datasets such as CTW1500 and ICDAR2015 demonstrate that the two scene text detection models proposed in this work effectively address the main challenges of arbitrary-shaped text detection in current natural scenes. |