Multimodal Fine-grained Interaction Modeling For Textual Video Grounding

Posted on:2021-02-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Song

Full Text:PDF

GTID:2428330605482471

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The task of video grounding is to localize one video segment that semantically corresponds to the given query.It's an interdisciplinary research project of computer vision and natural language processing.Compared with the task of single modality,it has more applications in real-world scenarios and higher research value.The key of video grounding is to build interaction between visual features and textual features and to capture the latent relation between visual information and textual information.Meanwhile,the video modality and textual modality have rich contextual information and deeply related on temporal dimension.However,most of the existing methods pay too much attention to build inter-model interaction but neglect the finegrained information and intra-model interaction.In this paper,we solve the task of video grounding based on the fine-grained modeling of multimodal information.Our contributions can be concluded as:1.We present an Intra-and Inter-modal Multilinear pooling(IIM)model to effectively combine the multi-modal features with considering both the intra-and intermodal feature interactions.To step further,we extend IIM to a generalized version GIIM which can take more than two input features.In the training procedure,we propose a simple yet effective multi-task learning framework by adding an action recognition branch for regulation,and we further introduce two label smoothing strategies.Experimental results on Ta Co S and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.2.We present a Multi-level intra-and inter-modal Attentional Reconstruction Network(MARN).The proposed method captures cross-modal attention with considering combine the multi-modal features with inter-modal feature interactions.It only relies on video-sentence level annotations during training stage and directly scores the candidate segments in test.Moreover,another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage.We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal representation.Experiments on Charades-STA and Activity NetCaptions datasets demonstrate the superiority of our MARN over the existing weaklysupervised methods.

Keywords/Search Tags:

Visual Grounding, Video Understanding, Cross-media, Weakly-supervised Learning, Multi-task Learning

PDF Full Text Request

Related items

1	One-stage Visual Grounding Research Based On Multi-task Learning
2	Research On Weakly Supervised Image Visual Semantic Understanding Based On Deep Learning
3	Visual Grounding Based On Deep Learning
4	Research On Several Problems In Weakly Supervised Visual Analysis And Understanding
5	Research On Weakly Supervised Learning Based On Controlled Random Walk Model
6	Multi-Task Joint Optimization For Visual Sentiment Prediction
7	Weakly supervised learning from multiple modalities: Exploiting video, audio and text for video understanding
8	Research On Mix-Supervised Learnig For Human Visual Understandng
9	Research On Some Problems Of Visual Semantic Understanding
10	Researches Of The Activity Understanding Based On Dynamic Representation Learning