| Given a video and a video-related natural language question,the video question answering task outputs the answer for the question.The Video QA combining computer vision and natural language processing,which is close to general artificial intelligence,and has high research value and broad application fields.The structural information constituted by the relationships between different objects in a video is very complex,which restricts understanding and reasoning.Fusion and interaction of features are the keys to video question answering.Benefiting from the representation ability of graph neural networks on structural information,cross-modal interactions can be modeled as updates of graph neural networks.With the development of multimodal fusion technology,Transformer-based visual-text reasoning models have become mainstream methods due to their excellent performance.Therefore,in order to solve the bottleneck of video question answering task,this thesis proposes the following three algorithms from different aspects:(1)Multi-scale progressive attention network for video question answering.Firstly,clips of different lengths are constructed from the frame sequence,and the length of the clip is used as scale information.Then,multi-scale graphs are generated for clips of different scales,and the vertices in the graph represent video features.To enable relational reasoning,graph convolutions are used to update vertices in each scale graph.Guided by the question,progressive attention is used to achieve multi-scale feature fusion during cross-scale graph interactions.Specifically,each graph is gradually updated in a top-down scale order,and then each graph is updated in a bottom-up scale order.Finally,the vertex features in the graph are fused with the question embedding and a classifier is used to find the answer.This model outperforms state-of-the-art methods on video question answering benchmarks(TGIF,MSVD,and MSRVTT datasets).(2)A universal quaternion hypergraph network for multimodal video question answering.Firstly,the features of multi-source information(video,subtitle,question and candidate answer)are extracted by pre-trained Res Net-152,Slow Fast and BERT,respectively.Secondly,the extracted features are embedded into the quaternion space to represent the multimodal information of the video.Next,a hypergraph is constructed based on visual objects detected in the video,where vertices represent clip-level quaternion features.Then,multimodal and structural inference is achieved through a quaternion hypergraph convolutional network.Finally,the proposed question answering inference module is used for span proposal and answer prediction to select the correct answer from the candidate answer.The multimodal video question answering task is evaluated on the TVQA and Drama QA datasets,and the experimental results show that the proposed algorithm outperforms the state-of-the-art methods.(3)Transformer model based on hypergraph inference for video question answering.Firstly,the question is extracted text features through word vector embedding.Secondly,pre-trained Faster-RCNN,3D Res Net-152,and Res Net-50 are used to extract detection features,motion features,and grid features,respectively.And the spatio-temporal relationships of videos is characterized through the learning of hypergraph neural networks.Then,token embedding,segment embedding and position embedding are performed on these four types of features respectively.And all tokens are concatenated together after adding the three embeddings.Then,all token features are sent to the encoder and decoder to achieve multi-modal fusion and interaction through self-attention mechanism.Finally,the pre-training stage makes predictions on the [MASK] token,and the fine-tuning stage trains the answer classifier on the[CLS] token.After pre-training on VATEX dataset,the accuracy of the proposed algorithm fine-tuned on the video question answering benchmark dataset surpasses current state-ofthe-art methods. |