Font Size: a A A

Video Captioning Based On Deep Learning And Multi-Feature Fusion

Posted on:2020-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2428330602952265Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In the era of mobile Internet and big data,the popularity of intelligent terminals and the development of social network make the multimedia data on the Internet show explosive growth.In the past,it has become an impossible task to label and describe the multimedia data completely by manual work.It is an inevitable trend to describe the multimedia data by computer.Aiming at the low accuracy of video captioning by existing algorithms,this paper mainly studies the method of video captioning based on deep learning and feature fusion,and validates the deep network model from the perspective of feasibility and reliability.The main contributions are as follows:(1)Aiming at the problem that traditional video captioning methods are not accurate enough to be suitable for large-scale data sets in domain video,relevant scholars began to study video captioning methods based on deep learning.In Chapter 3,an end-to-end video captioning method based on PNASNet two-dimensional image features and 3D spatiotemporal features is proposed.Firstly,a convolutional neural network PNASNet searched by progressive network structure search algorithm is used to extract video image features,and C3 D network is used to extract video action information.Then image features and action features are fused and input into an encode-decode model constructed by GRU.Finally,an end-toend video captioning model is constructed.The model is trained with MSR-VTT domain video data set and evaluated with BLEU,ROUGE,METEOR and CIDEr.It is found that the accuracy of the model is significantly better than that of traditional methods.However,because the model is too simple and the description quality is poor when the video scene is too complex,the robustness of the algorithm is not enough.(2)Aiming at the problem of low robustness of the model proposed in Chapter 3,Chapter 4of this paper proposes a video captioning method based on attention mechanism,which improves the translation quality of complex video by adding attention mechanism.In addition,this chapter introduces a spatiotemporal feature based on Spatial Temporal Graph Convolution.Firstly,human skeleton key points are extracted to construct a topological map.Then,the Spatial Temporal Graph Convolution network is used to extract local action information and fuse it with image features.Finally,an end-to-end video caption method is constructed with the addition of attention mechanism.Experiments show that the motion information obtained by convolution of spatiotemporal maps is better than C3 D.At the same time,the attention mechanism is added,which greatly improves the descriptive ability of the model for complex video.However,the complexity of the modified model is greatly increased,which requires a long time for data preprocessing and model training.In addition,when extracting video image features,the third and fourth chapters of this paper adopt equal-interval sampling method,which is easy to cause information redundancy and consume unnecessary computing resources.(3)In view of the shortcomings of video sampling strategy in Chapter 3 and Chapter 4 of this paper,a deep reinforcement learning based video sampling strategy is introduced in Chapter5 of this paper.The video summarization algorithm of reinforcement learning constructs a encoding-decoding model,uses reinforcement learning to train,and constructs a new video summarization reward function to enable unsupervised learning of the model.Video is sampled by video summary algorithm,which can reduce the amount of video data while retaining all the key information,and effectively reduce the computational load of video captioning model.In addition,considering that video contains not only image information but also audio information,this chapter extracts MFCC features of audio information,then uses convolutional neural network to extract deeper audio features,and fuses them with image feature and action features,finally completes a video captioning method that fuses audio and video features.With the addition of video summary Algorithm and audio information,the description ability of the model is further improved.
Keywords/Search Tags:Deeplearning, Video Analyse, Video Caption, Feature Fusion
PDF Full Text Request
Related items