Font Size: a A A

Weakly supervised learning from multiple modalities: Exploiting video, audio and text for video understanding

Posted on:2010-02-10Degree:Ph.DType:Dissertation
University:University of PennsylvaniaCandidate:Cour, TimotheeFull Text:PDF
GTID:1448390002976113Subject:Artificial Intelligence
Abstract/Summary:
As web and personal content become ever more enriched by videos, there is increasing need for semantic video search and indexing. A main challenge for this task is lack of supervised data for learning models. In this dissertation we propose weakly supervised algorithms for video content analysis, focusing on recovering video structure, retrieving actions and identifying people. Key components of the algorithms we present are (1) alignment between multiple modalities: video, audio and text, and (2) unified convex formulation for learning under weak supervision from easily accessible data.;At a coarse level, we focus on the task of recovering scene structure in movies and TV series. We present a weakly supervised algorithm that parses a movie into a hierarchy of scenes, threads and shots. Movie scene boundaries are aligned with screenplay scenes and shots are reordered into threads. We present a unified generative model and novel hierarchical dynamic program inference.;At a finer level, we aim at resolving person identity in video using images, screenplay and closed captions. We consider a partially-supervised multiclass classification setting where each instance is labeled ambiguously with more than one label. The set of potential labels for each face is the characters' names mentioned in the corresponding screenplay scene. We propose a novel convex formulation based on minimization of a surrogate loss. We show theoretical analysis and strong empirical proof that effective learning is possible even when all examples are ambiguously labeled.;We also investigate the challenging scenario of naming people in video without screen-play. Our only source of (indirect) supervision are person references mentioned in dialog, such as "Hey, Jack!". We resolve identities by learning a classifier from partial label constraints, incorporating multiple-instance constraints from dialog, gender and local grouping constraints, in a unified convex learning formulation. Grouping constraints are provided by a novel temporal grouping model that integrates appearance, synchrony and film-editing cues to partition faces across multiple shots. We present dynamic programming inference and discriminative learning for this partitioning model.;We have deployed our framework on hundreds of hours of movies and TV, and present quantitative and qualitative results for each component.
Keywords/Search Tags:Video, Weakly supervised, Present, Multiple
Related items