Research On Cross-Modal Learning Methods For Audio-Visual Association

Posted on:2023-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:R J Jia

Full Text:PDF

GTID:2558307148973069

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of deep learning and multimedia technologies,multi-modal learning and cross-modal learning,which focus on the relationship between different modalities,have become hot research areas in recent years.Because the structure of each modal data differs,determining how to correlate different modalities is one of the primary challenges of this type of task.Audio and image,as the two most common modalities in daily life,have a wealth of readily available data for model training and evaluation.In this thesis,we propose two new tasks to further investigate the possibilities in the audio-image field: audio-image cross-modal zero-shot learning and audio-driven instrumental animation.The ability of humans to infer the sound emitted by new objects based on prior knowledge motivated the audio-image cross-modal zero-shot learning task.To investigate the viability of zero-shot inference between audio and image,this thesis proposes three sub-tasks based on traditional tasks: zero-shot cross-modal retrieval,zero-shot sound source localization,and audio-based zero-shot recognition.To achieve these three tasks,a unified model LLINet(Look,Listen and Infer)is proposed,and a new dataset INSTRUMENT-32 CLASS is collected for model training and evaluation.In the retrieval task,this thesis presents a matching loss to utilize the contribution of negative instances in the batch for optimization.In the localization task,an attention module is used to correlate a given audio with the image regions where the sound source is located.Experimental comparisons on these three tasks demonstrate the superiority of the proposed model and methods.The audio-driven instrumental animation task’s major goal is to investigate the relationship between audio signals and visual actions.This task is proposed based on a strong correlation between sound and action.Unlike earlier audio-related generation work,this thesis treats audio as temporal features for expressing motion trends of activities.This thesis proposes ADIA(Audio-Driven Instrumental Animation),a new model which takes a given music clip as the driving source,combines the initial pose prediction to generate a sequence of performing actions,and drive the original image to generate a realistic video.Finally,in terms of both objective metrics and subjective evaluation,all experimental results demonstrate the superiority and feasibility of the proposed method.

Keywords/Search Tags:

Multi-Modal Learning, Cross-Modal Learning, Video Generation, Zero-Shot Learning

PDF Full Text Request

Related items

1	Cross-modal Retrieval And Annotation Based On Hashing Learning Method
2	Research On Geometric Solution Method Based On Cross-modal Learning
3	Zero-shot Image Classification Based On Cross-modal Semantic Alignment
4	A Study On Few-shot Learning Of Non-independent Identically Distributed Data
5	Multi-modal Learning Based On Single-modal And Multi-modal Data
6	Research On Multi-modal Learning For Imbalanced Modal Data
7	Cross-modal Video Retrieval Algorithm Based On Multi-semantic Clues And Metric Learning
8	Zero-Shot Learning Based On Cross-Modal Feature Synthesis
9	Research On Multi-modal Learning Based On Shared Subspace
10	Semantic Alignment-based Robust Cross-modal Retrieval