Font Size: a A A

Object-oriented Two-Stream Network And Heterogeneous Graph Reasoning On Video Question Answering

Posted on:2024-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2568307103975319Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The video question answering(Video QA)task aims to automatically generate a natural language answer that follows grammar and semantic rules by fine-grained understanding and reasoning about both the video content and the question,given a video clip and a question described in natural language.The challenges of the current video question answering task are mainly in two aspects:(1)Most of the current video question answering methods encode video by appearance and motion features at the frame level,which ignores modeling of fine-grained object features and spatiotemporal interactions between objects.In addition,the object detection methods used in the video question answering task cannot model temporal information,resulting in a lack of analysis of the actions of extracted objects.(2)Current methods usually leverage attention mechanisms to uncover latent correlations between video content and question semantics.Although these methods exploit the interaction between different modalities to improve reasoning ability,the reasoning process of inter-and intra-modality interactions cannot be effectively integrated into a uniform model.To address these two challenges,this thesis conducts the following researches:To solve the first problem,this thesis proposes a video question answering method based on an object-oriented two-stream attention network.Firstly,an object-oriented two-stream video feature representation method with location-awareness and background-awareness is proposed,which solves the problem of the lack of finegrained object features and action analysis.The two-stream feature refers to the static appearance and dynamic motion features.Secondly,based on the object-oriented twostream architecture,a question-guided attention module is further proposed to finegrainedly explore inter-and intra-modal interaction relations and generate reconstructed two-stream representations of videos based on the question.To solve the second problem,this thesis proposes a video question answering method based on a co-attention and heterogeneous graph reasoning.This method proposes a co-attention unit and a heterogeneous graph reasoning unit based on an object-oriented two-stream architecture.The co-attention unit utilizes an attention mechanism to transform appearance,motion,and question features into a common semantic space,thus achieving a cross-modal fusion of these three modalities.The heterogeneous graph reasoning unit models the appearance,motion,and question features by building a heterogeneous graph structure,and the reasoning of the inter-and intra-modal interaction is completed in a uniform graph structure.The proposed methods of this thesis are experimented on the large-scale video question answering datasets,including TGIF-QA,MSVD-QA,and MSRVTT-QA.The accuracy of question answering on action and transition tasks in the TGIF-QA dataset increased by 5.6 and 4 percentage points,respectively,compared to state-of-the-art methods,fully demonstrating the effectiveness of the proposed methods in this thesis.
Keywords/Search Tags:Deep Learning, Video Question Answering, Cross-modal Fusion, Attention Mechanism, Graph Convolutional Network
PDF Full Text Request
Related items