| With the rapid popularization of mobile intelligent devices,the large-scale rise of social networks and video websites,video has become an important information carrier.Consequently,the demand for intelligent analysis and intelligent processing of video is also increasingly urgent.As a high-level feature,text in video directly contains the semantic information,often accurately expresses the key information about the video.Therefore text is an important way to interpret the video content,and also an important basis for video retrieval as well as video content understanding.As a result,video text detection technology not only has the huge research value for the video intelligent analysis and processing,but also has broad application prospects in intelligent driving,geographical positioning,network security and so on.Traditional text detection algorithms are aiming at a single image.Because of uneven illumination,low resolution,complex background,multi-orientation text,the large number of frames in video,the direct application of the existing single image text detection algorithms to video often has poor accuracy and slow speed.The prominent characteristic of video is that it contains redundant information in time domain.In this paper,by mining the redundant information in time domain to meet the above challenges,we propose an effective video text detection scheme which improves both detection accuracy and speed.Our main work and contributions are:1.A fully convolutional neural network for video frame text detection task is designed.This model can extract rich features through multi-layer neural network.It can not only detect the horizontal text,but also the multi-orientation text.The experiments prove the validity and generality of our detection model.2.Detecting the video frame by frame with the text detection model proposed above has low computational efficiency due to the large number of video frames.Based on the fact that the content in adjacent video frames change small,we use optical flow information to accelerate video text detection.We only need to extract feature maps directly from the key frames through feature extraction network,and then use the optical flow information to propagate the feature maps of key frames to adjacent video frames.Detection speed is accelerated while the detection accuracy is maintained as the featureextraction time is greatly saved.3.In the case of complex text background,uneven illumination,video blur and so on,the text detection model of a single video frame inevitably has some limitations,such as missing detection and false detection.In order to solve this problem,we make use of the complementary information in adjacent frames to further mine the temporal information,and fusion the detection results of single frame,so as to correct the false detection and missing detection and improve the detection accuracy.In this paper,the proposed algorithm is experimentally validated on two public video text datasets : the Minetto and ICDAR2015.The experiments show that the proposed detection scheme has achieved good results in terms of detection speed and detection accuracy. |