Font Size: a A A

Multimodal Fine-grained Interaction Modeling For Textual Video Grounding

Posted on:2021-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y J SongFull Text:PDF
GTID:2428330605482471Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The task of video grounding is to localize one video segment that semantically corresponds to the given query.It's an interdisciplinary research project of computer vision and natural language processing.Compared with the task of single modality,it has more applications in real-world scenarios and higher research value.The key of video grounding is to build interaction between visual features and textual features and to capture the latent relation between visual information and textual information.Meanwhile,the video modality and textual modality have rich contextual information and deeply related on temporal dimension.However,most of the existing methods pay too much attention to build inter-model interaction but neglect the finegrained information and intra-model interaction.In this paper,we solve the task of video grounding based on the fine-grained modeling of multimodal information.Our contributions can be concluded as:1.We present an Intra-and Inter-modal Multilinear pooling(IIM)model to effectively combine the multi-modal features with considering both the intra-and intermodal feature interactions.To step further,we extend IIM to a generalized version GIIM which can take more than two input features.In the training procedure,we propose a simple yet effective multi-task learning framework by adding an action recognition branch for regulation,and we further introduce two label smoothing strategies.Experimental results on Ta Co S and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.2.We present a Multi-level intra-and inter-modal Attentional Reconstruction Network(MARN).The proposed method captures cross-modal attention with considering combine the multi-modal features with inter-modal feature interactions.It only relies on video-sentence level annotations during training stage and directly scores the candidate segments in test.Moreover,another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage.We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal representation.Experiments on Charades-STA and Activity NetCaptions datasets demonstrate the superiority of our MARN over the existing weaklysupervised methods.
Keywords/Search Tags:Visual Grounding, Video Understanding, Cross-media, Weakly-supervised Learning, Multi-task Learning
PDF Full Text Request
Related items