Font Size: a A A

Audio-visual Data Recognition Based On Adversarial-metric Learning And Attribute Guidance Learning

Posted on:2022-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:M L HuFull Text:PDF
GTID:2518306542463614Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Audio-visual data recognition aims to identify the identity between audio clips and facial images.The goal of this task is to match the audio clips corresponding to the facial image,or vice versa.This technology will provide tremendous help for information retrieval and criminal investigation in the future.At present,the main challenges of this task include noisy audio clips,low-resolution images,and the natural gap between different modalities.In the last few years,researchers have proposed different methods to solve this task in response to different challenges,which mainly concentrates on learning discriminative feature representations.However,the results of audio-visual data recognition are still far from reaching the requirements of practical applications.To overcome various challenges and carry out more in-depth research and exploration,we focus on the solving cross-modal modality gap between audio clips and facial images.The contributions are as follows:(1)Considering the natural heterogeneous gap between audio clips and face images,we propose a novel adversarial learning framework.Adversarial learning aims to generate modality-independent feature representation for each person in each modality.In addition,considering that the feature representation of the same identity should be more compact,we propose to utilize metric learning to learn a robust similarity metric for audio-visual data recognition.By integrating modality-independent representation and robust metric learning for audio-visual data recognition into an end-to-end trainable network,our method can overcome the heterogeneous issue between audio and image modalities and achieves a considerable performance.(2)Considering the heterogeneous gap between audio clips and face images,we propose to utilize high-level semantic attribute information to shrink the cross-modality gap.By constraining the consistency of the facial image and the audio clips in the public attributes,we first pull the data of different modalities closer in the public attribute space,which can alleviate the gap between the cross-modal data.In addition,considering the similarity between the same identities,we propose to leverage the private attributes in each identity to increase the intra-class consistency.By incorporating private attributes into the public attribute learning framework,the proposed method can narrow the gap between the same attributes of different modal while maintaining intra-class consistency.Comprehensive experimental results demonstrate the improvement of the proposed method for audio-visual data recognition.
Keywords/Search Tags:Audio-visual data recognition, Adversarial learning, Metric learning, Attributes
PDF Full Text Request
Related items