Weakly supervised learning from multiple modalities: Exploiting video, audio and text for video understanding

Posted on:2010-02-10

Degree:Ph.D

Type:Dissertation

University:University of Pennsylvania

Candidate:Cour, Timothee

Full Text:PDF

GTID:1448390002976113

Subject:Artificial Intelligence

Abstract/Summary:

As web and personal content become ever more enriched by videos, there is increasing need for semantic video search and indexing. A main challenge for this task is lack of supervised data for learning models. In this dissertation we propose weakly supervised algorithms for video content analysis, focusing on recovering video structure, retrieving actions and identifying people. Key components of the algorithms we present are (1) alignment between multiple modalities: video, audio and text, and (2) unified convex formulation for learning under weak supervision from easily accessible data.;At a coarse level, we focus on the task of recovering scene structure in movies and TV series. We present a weakly supervised algorithm that parses a movie into a hierarchy of scenes, threads and shots. Movie scene boundaries are aligned with screenplay scenes and shots are reordered into threads. We present a unified generative model and novel hierarchical dynamic program inference.;At a finer level, we aim at resolving person identity in video using images, screenplay and closed captions. We consider a partially-supervised multiclass classification setting where each instance is labeled ambiguously with more than one label. The set of potential labels for each face is the characters' names mentioned in the corresponding screenplay scene. We propose a novel convex formulation based on minimization of a surrogate loss. We show theoretical analysis and strong empirical proof that effective learning is possible even when all examples are ambiguously labeled.;We also investigate the challenging scenario of naming people in video without screen-play. Our only source of (indirect) supervision are person references mentioned in dialog, such as "Hey, Jack!". We resolve identities by learning a classifier from partial label constraints, incorporating multiple-instance constraints from dialog, gender and local grouping constraints, in a unified convex learning formulation. Grouping constraints are provided by a novel temporal grouping model that integrates appearance, synchrony and film-editing cues to partition faces across multiple shots. We present dynamic programming inference and discriminative learning for this partitioning model.;We have deployed our framework on hundreds of hours of movies and TV, and present quantitative and qualitative results for each component.

Keywords/Search Tags:

Video, Weakly supervised, Present, Multiple

Related items

1	Research On Several Problems In Weakly Supervised Visual Analysis And Understanding
2	Research On Weakly Supervised Object Detection And Classification Via Multiple Instance Learning Methods
3	Research On Object Detection Algorithm Based On Weakly Supervised Learning
4	Research On Visual Sentiment Analysis In Web Images Based On Weakly Supervised Learning
5	Human Detection Under Arbitrary Poses Based On Multiple Instance Learning
6	Weakly Supervised Learning Cosegmentation And Localization For Images
7	Research Of Spatiao-temporal Attention Mechanism For Weakly Supervised Object Detection And Segmentation
8	Research On Weakly Supervised Learning Based On Controlled Random Walk Model
9	A Research On Weakly Supervised Learning For Video Segmentation And Action Recognition
10	Weakly Supervised Temporal Action Detection In Untrimmed Video