Font Size: a A A

Research On Multi-Modal Video Captioning

Posted on:2023-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2568306914479834Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Video captioning is a research direction that combines computer vision and natural language processing.Its purpose is to take video as an input and output a short English natural language sentence to describe the main content of the video.Video captioning plays an important role in intelligent applications such as video retrieval,visually impaired assistance,surveillance video analysis and video classification.The key point of the research on multimodal-based video captioning technology is to extract various modal information from the video and then make a reasonable selection and fusion,so as to improve the performance of the algorithm,improve the fluency,accuracy,and abundancy of the output natural language sentences.Multi-modal feature selection and fusion methods can be divided into two types:soft fusion and hard fusion.Soft fusion is the weighted fusion of contextual features of different modalities through attention mechanism,and hard fusion is to select the most relevant modal context feature according to the words with different parts of speech in the real annotated sentences.The subject of this paper is to research and improve the current algorithms by two methods of multi-modal feature selection and fusion.The two specific research points of this paper are as follows:1.Aiming at the problem that the correlation and constraint relationship between visual and audio modal information are not clear,this paper proposes a video captioning algorithm based on audio-visual feature fusing.The algorithm first uses ResNet152 to extract visual feature sequence,uses MFCC and VGGish to extract audio feature sequences,and sends them to independent LSTMs(Long-Short Term Memory networks,LSTM)for time series encoding,and then performs context feature fusion on the output of time series encoding.The first step is to use the temporal attention mechanism to obtain the modal context features,the second step to use the modal attention mechanism to obtain the audio-visual fusing context feature,and finally sends it to the decoder together with the word embedding vector of the previous time step to calculate the word probability distribution of all words in the dictionary at the current time step,and select the word with the highest probability as the output.Experiments on the MSR-VTT dataset verify the performance of the proposed algorithm.2.In view of the large limitation of the modal information used in migrating the visual reasoning method to the video captioning algorithm,which leads to over-reliance on refined visual features and unreasonable reasoning process,this paper proposes a video captioning method based on multi-modal feature reasoning.The algorithm first uses 2D CNN and 3D CNN feature extractors to extract static and dynamic visual features,use MFCC to extract audio features,and send them to independent bidirectional LSTMs for time series encoding.The algorithm also sends the hidden state of the decoder as one of the inputs to an independent LSTM for time series encoding.Then it performs visual reasoning and feature selection on the encoded multimodal features,and improve the accuracy of reasoning and selection by introducing audio information.Finally,the decoder decodes the chosen context feature to generate English natural language sentence.Experiments on the MSR-VTT dataset verify the performance of the proposed algorithm.
Keywords/Search Tags:video captioning, multi-modal, attention mechanism, deep learning
PDF Full Text Request
Related items