| With the popularity of low-cost high-performance mobile, digital or wearable devices, as well as the rapid development of Internet and mobile Internet technologies, the scene text extraction technology has become more and more important and is being applied to more and more new scenarios. This makes text extraction from natural scene images become a hot research topic in computer vision research field recently. As a key component of text extraction technology, text detection problem has also attracted increasingly more attentions from researchers. However, the very high variability of text objects and some unwanted factors caused by uncontrolled image acquisition environments such as uneven illumination, occlusion, blur and perspective distortion, make text detection a very challenging problem. In this dissertation, several main open problems in text detection are discussed systematically. To address these problems, a novel text detection approach, which is based on a so-called color-enhanced CER (Contrasting Extremal Regions) and shallow neural networks, is proposed. The proposed approach has achieved superior performance on two popular benchmark datasets. Moreover, in some application scenarios, user-intention information can be used to simplify the text detection problem. This kind of text detection problem is called user-intention guided text detection. A novel component-tree based approach and its improved version have been proposed to address this problem. Experimental results demonstrate that the component-tree structure is a very effective image representation method to solve the user-intention guided text detection problem. Therefore, this dissertation can be roughly divided into two parts, namely, fully automatic text detection and user-intention guided text detection.Fully automatic text detection includes two key sub-problems, namely, candidate text connected-component (CC) extraction and text/non-text classification. To deal with the former sub-problem, we firstly point out the limitations of Extremal Region (ER) which is the most widely used method to extract candidate text CC, and then propose the color-enhanced CER to partially address the limitations of ER. The latter sub-problem, namely text/non-text classification, is the bottleneck problem of text detection. After a detailed discussion on the main difficulties of this problem, we point out that the ambiguity problem and the class imbalance problem are the major reasons for the bad generalization ability of text class which is the minority class in the text/non-text classification problem. To overcome this problem, we take into account the feature design, system design as well as the training data preparation method at the same time. Instead of using handcrafted features by previous methods, we propose to use neural networks to learn "meaningful" features directly from the resized raw binary images which correspond to candidate text CCs. This kind of feature design can not only avoid the unwanted information loss caused by the handcrafted features, but also has low computational complexity. To overcome the ambiguity problem, we rely on the smart system design which is mainly composed of the pre-pruning stage, candidate text-line generation stage and the text-line verification stage. In the pre-pruning stage, we try to use the shape or texture information of isolated candidate text CCs to prune the unambiguous non-text CCs as many as possible. This will simplify the following candidate text-line generation problem a lot. After the candidate text-line generation stage, the context information can be used to solve the ambiguity problem for isolated candidate text CCs. To simplify the classification problem in the pre-pruning stage, a "divide-and-conquer" strategy, which is based on the specific properties of text objects, is proposed. Each candidate text CC is labeled reliably by rules as one of five types, namely, Long, Thin, Fill, Square-large and Square-small, and classified as text or non-text by a corresponding neural network, which is trained by an ambiguity-free learning strategy. The ambiguity-free learning strategy can improve the generalization ability of text class significantly so that we can use as many synthetic positive training samples as possible safely. Using synthetic data can not only reduce the data labeling effort, but also keep the data clean and uniformly distributed, which is helpful for the performance of the trained classifiers. Moreover, the ambiguity-free learning strategy can be used to sample a small subset of non-text training samples from the original problem space effectively, which can address the data imbalance problem properly. Thanks to the ambiguity-free learning strategy, even shallow neural networks can achieve competitive performance as deep neural networks on the extracted training data in our task. So we can use shallow neural networks as the text/non-text classifiers in our system, which can reduce the computational complexity greatly. The proposed approach has achieved superior performance on both ICDAR-2011and ICDAR-2013benchmark datasets. To address the user-intention guided text detection problem, we propose a novel component-tree based approach and demonstrate that the component-tree is a very effective image representation method to solve the user-intention guided text detection problem. Compared with the conventional scan-line based approach, the component-tree based approach achieves much better performance. Moreover, the original component-tree based approach is improved further. |