| With the rapid development of deep learning and multimedia technologies,multi-modal learning and cross-modal learning,which focus on the relationship between different modalities,have become hot research areas in recent years.Because the structure of each modal data differs,determining how to correlate different modalities is one of the primary challenges of this type of task.Audio and image,as the two most common modalities in daily life,have a wealth of readily available data for model training and evaluation.In this thesis,we propose two new tasks to further investigate the possibilities in the audio-image field: audio-image cross-modal zero-shot learning and audio-driven instrumental animation.The ability of humans to infer the sound emitted by new objects based on prior knowledge motivated the audio-image cross-modal zero-shot learning task.To investigate the viability of zero-shot inference between audio and image,this thesis proposes three sub-tasks based on traditional tasks: zero-shot cross-modal retrieval,zero-shot sound source localization,and audio-based zero-shot recognition.To achieve these three tasks,a unified model LLINet(Look,Listen and Infer)is proposed,and a new dataset INSTRUMENT-32 CLASS is collected for model training and evaluation.In the retrieval task,this thesis presents a matching loss to utilize the contribution of negative instances in the batch for optimization.In the localization task,an attention module is used to correlate a given audio with the image regions where the sound source is located.Experimental comparisons on these three tasks demonstrate the superiority of the proposed model and methods.The audio-driven instrumental animation task’s major goal is to investigate the relationship between audio signals and visual actions.This task is proposed based on a strong correlation between sound and action.Unlike earlier audio-related generation work,this thesis treats audio as temporal features for expressing motion trends of activities.This thesis proposes ADIA(Audio-Driven Instrumental Animation),a new model which takes a given music clip as the driving source,combines the initial pose prediction to generate a sequence of performing actions,and drive the original image to generate a realistic video.Finally,in terms of both objective metrics and subjective evaluation,all experimental results demonstrate the superiority and feasibility of the proposed method. |