Font Size: a A A

Research On Video Captioning Method Based On Graph Convolution And Self-Attention Mechanism

Posted on:2023-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2568306800466614Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of video data,the video captioning task has attracted more and more attention from researchers.Video captioning needs to convert the input video data into a natural language captioning output describing the video content,which has important application value in the fields of video summarization and blind assistance.How to make the computer understand the content of the video like a human and express it in language accurately is a problem that has not been solved perfectly.Aiming at the problems of lack of contextual information and inaccurate description in existing video captioning models,this paper is based on the video captioning model based on the encoder-decoder framework.The attention mechanism and other three aspects have been studied,and related improvement methods have been proposed.The following are the majority of research work and achievements in this paper:1.A video captioning method based on high-level semantics and feature fusion is proposed and implemented.The method extracts video apparent features through Res Net network,I3 D network extracts video dynamic features,Faster-RCNN network extracts video object features,Bi-LSTM network encodes high-level semantic information,and finally integrates different features through attention mechanism to obtain a video captioning model the input sequence.The experimental results show that the method can optimize the input information of the video captioning model and improve the accuracy of the video captioning sentences generated by the model.2.A video captioning method based on graph convolution and dynamic reasoning is proposed and implemented.The method learns the latent semantic information of video features through a graph convolutional network,and the dynamic inference module uses different features to dynamically generate visual words.The experimental results show that the method can effectively extract the latent semantics of videos,generate video captioning,and solve the cross-modal problem of video text and the problems affected by redundant information.3.A video captioning method based on SA+GRU is proposed and implemented.This method improves the performance of the model through the method of selfattention mechanism,and the GRU decoder improves the computational efficiency.Combined with the method implemented above,a video captioning method based on SA+GRU is realized.Experimental results show that the method improves the accuracy and computational efficiency of the model-generated video captioning.The research contribution of this paper: A method based on high-level semantics and feature fusion is proposed to improve the quality of model input;a method based on graph convolution and dynamic reasoning is proposed,so that the model can better utilize different features and eliminate the influence of redundant information;The method based on SA+GRU improves the learning ability and computational efficiency of the model for context information to a certain extent,and makes the model generate more accurate video captioning.
Keywords/Search Tags:Video Captioning, High-Level Semantic, Graph Convolution, SelfAttention Mechanism
PDF Full Text Request
Related items