| With the explosion of the information era, more and more image or video data can be spread in a variety of ways. How to obtain helpful information with a high-efficiency way from such a large volume of data has become a new challenge. Text in natural image is an important source of information that is helpful for many real-world applications, including image retrieval, human-computer interaction, navigation system, etc. The scene text reading system that consists of text reading and recognition has been extensively studied in recent years. However, only a small portion of images in a large volume of data contain text. Therefore, an efficient preprocessing algorithm that quickly distinguishes whether an image contains text or not is desirable. The main content of this thesis is described as following:In order to solve the lack of an image database for text image discrimination, we have collected a large dataset from the Internet and adopt some relative metrics. Most of images are natural images, and few of them are born-digital images and document images. Due to the diversity of text in this dataset, including layout, font, color and language type, this dataset can serve as a standard text/non-text image classification benchmark for evaluating different algorithms.From the starting point of feature encoding, we propose an effective and efficient method named CNN Coding that combines the advantages of three widely used techniques in this field: MSER, CNN and Bo W. The performance on the collected dataset shows CNN Coding outperforms other typical methods. Besides, the comparison between our method and several typical image classification algorithms can help us to further explore the difficulties and needs of this question.Considering the demand for high usability, feasibility and low time consuming, we also propose a novel convolutional neural network variant called Multi-scale Spatial Partition Network(MSP-Net) for text/non-text image classification in the wild. With the operation of multi-scale spatial partition, the network transforms image-level classification into blocklevel classification. As long as one of image blocks is classified as text block, the whole image will be regarded as text image. Furthermore, the MSP-Net predicts the results of all blocks in a single forward propagation, which is very efficient. The results and speed of MSP-Net on several datasets outperform other methods, even including the method of CNN Coding.This paper deals with several current problems of text/non-text image classification, such as benchmark, metrics, etc. After the analysis of the essentials and difficulities, two efficient and effective algorithms are proposed, which can be served as vital tools for mining textual information from a large volume of image or video data. |