| ABSTRACT:In recent years, the rise of the Microblog network has benefited scientific research by providing research data, and has promoted Natural Language Processing, Community Discovery in Complex Network and other fields of research. Extracting of the content in multimedia data as research material is helpful for those fields. Meanwhile, it will be convenient for computers handling the data automatically, as a promoting of the ability of computer automated progressing of those data, and the potential commercial use of the technology is also greatOptical Character Recognition (OCR), as an important branch of pattern recognition, has become a mature field and has lots of applied technology after years of development The research targets are extended to natural scene text recognition, handwritten character text recognition, etc., and have made an achievementAfter analyzing the theoretical basis about the OCR technology, this paper mainly researched and improved the text detection method. By using the OCR technology in Microblog content research, extracted and saved the text content of the microblog image samples, which could be provided as research data for other research fields.By combining the vein feature and edge feature of image in text detection, the author proposed a method using Gabor filter group to transform the raw image. Then non-text object were to be omitted by priori knowledge of the text area. The edge detection step would be completed by Sobel method. By merging vein feature and edge feature, with image morphology handling on the image, the text area would be detected. This method promoted the accuracy in text detection. The actual application of the method depends little on experience, with strong adaptability. While extracting the feature of a single character, the author applied multi-scale Gabor filter to extract image feature, constructing a group of vector that presents the image in multi-scale and multi-direction, and applying SVM to implement the classification of texture features.At the end, the author applied the method to the Microblog network, established a system that fetched images and recognized text in them, completing the image retrieving and OCR function. Finally the system was verified by experimentThis work has been supported by the National Natural Science Foundation of China under Grant61172072,61271308, and Beijing Natural Science Foundation under Grant4112045, and the Research Fund for the Doctoral Program of Higher Education of China under Grant W11C100030, the Beijing Science and Technology Program under Grant Z121100000312024, and Beijing Municipal Commission of Education Discipline Construction and Graduate Construction Project. |