Research And Implementation Of Video Question Answering With Multimodal Data

Posted on:2024-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:F T Li

Full Text:PDF

GTID:2568306944460364

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of the multimedia internet,multimodal data,which combines vision and language,has gradually become the mainstream information medium and plays an important role in daily life.In the field of mono-modal research such as natural language processing and computer vision,researchers have made a great progress.However,how to narrow down the semantic gap between modalities and reason on cross-modal information has become a hot research spot.As a typical cross-modal task,video question answering(VideoQA)has attracted extensive attention since it was proposed.VideoQA plays an important role in video description,video sentiment analysis,and other related tasks.At the same time,it also has great commercial and social values in smart healthcare,smart education,and information retrieval.At present,most of the existing VideoQA methods are mainly based on visual reasoning,and less attention is paid to the understanding of temporal and multimodal semantic information in video.To this end,the main research and development on VideoQA can be summarized as follows:(1)This paper proposes an attention-based video feature aggregation model named Attention VLAD.First,a customized local aggregation descriptors algorithm is proposed to study the residual distribution and quality of features.Then a multimodal feature fusion module is introduced to calculate the correlation of different modalities for joint optimization.Comparative experiments and ablation studies demonstrate the effectiveness of the model.(2)This paper proposes a hierarchical attention model for VideoQA named RHA.For the spatial relationship and semantic relationship in videos,RHA builds two relation graphs,and the entity features are updated through graph attention network.Then,a hierarchical attention mechanism is proposed to fuse multimodal features at spatial,temporal,and modality levels.The experimental results show that the performance of the proposed method is better than baseline methods.(3)This paper proposes a knowledge-enhanced VideoQA model named KEVQA.For the input video and knowledge,the semantic relation graph between visual entities and the implicit relation graph between textual blocks are constructed,respectively.Then the relations are embedded into nodes through a graph encoder.Then,KEVQA introduces a self-supervised cross-modal learning module to keep the modal consistency and complementarity.Experiments on three large-scale VideoQA datasets prove the effectiveness of the method.

Keywords/Search Tags:

multi-modal, video question answering, attention mechanism, graph networks, knowledge enhancement

PDF Full Text Request

Related items

1	A Research Of Video Question Answering Based On Deep Learning
2	Research On Knowleage Base Question Answering Techniques For Multi-turn Scenarios
3	Object-oriented Two-Stream Network And Heterogeneous Graph Reasoning On Video Question Answering
4	Research On Intelligent Question Answering Technology Of Ship Navigation Based On External Knowledge
5	Research On Visual Question-Answering Methods Based On Attention Mechanism
6	Question Answering Over Knowledge Base Method Based On Mutil-Angle Cross-Attention And Feature Enhancement
7	Research On Visual Question Answering Based On Knowledge Graph And Answer Space Optimization
8	Research On Knowledge Graph Question Answering Based On Representation Learning
9	Research On Visual Question Answering Technology Based On Knowledge Graph
10	Research On Key Technologies Of Intelligent Question Answering Based On Domain Knowledge Graph