Font Size: a A A

Jointly Cross-and Self-modal Graph Attention Networks For Query-based Moment Retrieval In Videos

Posted on:2021-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:X Y QuFull Text:PDF
GTID:2518306104486714Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Positioning the actions in the video is an important means to understand the video content.However,the actions in the video often have rich semantic content and complex background knowledge.These actions cannot be completely summarized in a predefined set of actions.In order to solve this problem,Query-based Moment Retrieval in Videos(QMRV)is proposed.QMRV's main research is: given a natural language description as a retrieval sentence,we need to start from an unedited long The video segment corresponding to the description is located in the video,that is,the start and end points of the video segment are determined.As a newly emerging field,QMRV task is gaining more and more attention from researchers in various fields due to its wide application in video understanding and human-computer interaction.In recent years,a series of methods based on sliding window matching or cross-modal attention mechanism have been proposed,which has made great progress in QMRV,but this field is still very challenging,and the positioning results of the latest methods are still not accurate enough.At the same time,as a cross-modal task,precise video segment positioning has a high demand for modal coding and full cross-modal interaction.This paper studies a series of problems that QMRV currently needs to solve,and proposes a Jointly Cross-and Self-Modal Graph Attention Networks(CSMGAN).First of all,compared with the previous work that only uses recurrent neural networks to encode the information of the retrieval sentences,we propose a hierarchical sentence encoder(Hierarchical Sentence Encoder)to extract the semantic features of the retrieval sentences.The main ideas are: From the multiple levels of words,phrases,and sentences,the information of the retrieval sentence is fully captured and fused,so as to achieve more accurate capture of the text information that helps to locate the video segment.Then,traditional work often focuses on information interaction between modalities,while ignoring the importance of information within the modalities for positioning.However,the information in the modal plays a great role in the association of elements belonging to the same action in the sequence and the differentiation of elements between different actions.This feature is very helpful for accurate video positioning.Therefore,we propose a jointly cross-and self-modal attention graph(Jointly Cross-and Self-Modal Graph)to simultaneously consider the interaction between cross-modalities and relationship modeling within the self-modality.The main idea is to use cross-modal interaction to correlate the information between the two modalities,and then use the relationship modeling within the self-modality to correlate the internal related elements.Through multiple layers of joint attention graphs,high-order interactions between two modal information are realized.Finally,we conducted experiments on the three datasets Activity Caption,TACo S,and Charades-STA to verify the effectiveness of our network.The experimental results show that our proposed CSMGAN has achieved significantly better results than the existing best methods on all three datasets.At the same time,we conducted rich ablation experiments and visualization experiments,and conducted in-depth research on the elements in our network.
Keywords/Search Tags:Cross-modal Understanding, Video Segment Localization, Graph Attention Network, Natural Language Processing
PDF Full Text Request
Related items