| With the continuous development of the multimedia internet,multimodal data,which combines vision and language,has gradually become the mainstream information medium and plays an important role in daily life.In the field of mono-modal research such as natural language processing and computer vision,researchers have made a great progress.However,how to narrow down the semantic gap between modalities and reason on cross-modal information has become a hot research spot.As a typical cross-modal task,video question answering(VideoQA)has attracted extensive attention since it was proposed.VideoQA plays an important role in video description,video sentiment analysis,and other related tasks.At the same time,it also has great commercial and social values in smart healthcare,smart education,and information retrieval.At present,most of the existing VideoQA methods are mainly based on visual reasoning,and less attention is paid to the understanding of temporal and multimodal semantic information in video.To this end,the main research and development on VideoQA can be summarized as follows:(1)This paper proposes an attention-based video feature aggregation model named Attention VLAD.First,a customized local aggregation descriptors algorithm is proposed to study the residual distribution and quality of features.Then a multimodal feature fusion module is introduced to calculate the correlation of different modalities for joint optimization.Comparative experiments and ablation studies demonstrate the effectiveness of the model.(2)This paper proposes a hierarchical attention model for VideoQA named RHA.For the spatial relationship and semantic relationship in videos,RHA builds two relation graphs,and the entity features are updated through graph attention network.Then,a hierarchical attention mechanism is proposed to fuse multimodal features at spatial,temporal,and modality levels.The experimental results show that the performance of the proposed method is better than baseline methods.(3)This paper proposes a knowledge-enhanced VideoQA model named KEVQA.For the input video and knowledge,the semantic relation graph between visual entities and the implicit relation graph between textual blocks are constructed,respectively.Then the relations are embedded into nodes through a graph encoder.Then,KEVQA introduces a self-supervised cross-modal learning module to keep the modal consistency and complementarity.Experiments on three large-scale VideoQA datasets prove the effectiveness of the method. |