Font Size: a A A

Research On Domain Adaptation Methods In Speaker Recognition

Posted on:2024-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:H R HuFull Text:PDF
GTID:2568306932455814Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speaker recognition,also known as voiceprint recognition,is a biometric identification technology that utilizes speech features for identity authentication.Its core problem lies in extracting representative vectors from speech signals that contain discriminative information about the speaker while eliminating irrelevant information such as speech content,sampling channels,environmental noise,and vocalization methods,in order to perform speaker identification or verification.Speaker recognition involves various research fields,including speech signal processing,pattern recognition,and machine learning.It has extensive applications in information security and personalized services,among others.In recent years,with the continuous development and maturation of deep learning,deep neural networks such as TDNN or ResNet have become the mainstream methods in the field of speaker verification due to their powerful representation learning capabilities.They have demonstrated satisfactory performance on benchmark datasets under constrained conditions.However,in complex real-world application scenarios,the performance of speaker verification systems often significantly deteriorates due to various factors such as channel conditions,dialects,and speaking styles,which result in the mismatch between the training and testing data that violates the assumption of independent and identically distributed samples.This issue,known as domain shift,arises from the distribution mismatch between the source domain and the target domain.Since collecting labeled data that sufficiently covers all application environments is extremely challenging,the rapid adaptation of speaker verification models trained on large-scale source domains to new target domains using limited weakly labeled or even unlabeled samples has become an urgent problem in practical applications.It holds significant research value and practical significance in the field of speaker verification.To address this problem,this dissertation investigates domain adaptation methods for speaker recognition systems,focusing on the structural framework,optimization objectives,and data aspects.(1)In terms of structural framework,existing domain adaptation methods often only target the backend or deep layers of the model,failing to effectively address the domain shift issue in the shallow layers.Additionally,these methods excessively rely on domain and speaker labels.To address these limitations,this dissertation proposes two domain-robust modules to enhance the robustness of intermediate features in deep network models against domain shift problems.The Domain-Aware Batch Normalization(DABN)module aims to alleviate the feature distribution differences between the source and target domains,while the Domain-Agnostic Instance Normalization(DAIN)module aims to address the mismatch issue within different unknown sub-domains.Furthermore,this dissertation introduces a self-supervised cross-domain joint training framework based on Smoothed Knowledge Distillation(SKD),aiming to better explore the latent category information in the target domain.(2)In terms of optimization objectives,existing methods often fail to fully leverage the prior knowledge from a large-scale source domain,resulting in unsatisfactory domain adaptation effects.Therefore,this dissertation proposes a novel solution by decoupling the traditional global distribution alignment strategy into intra-class distribution and inter-class distribution alignment,in order to more fully and finely transmit the well-measured knowledge at the source domain’s class level to the target domain,rather than directly using target domain data for metric learning or global distribution adjustment.(3)In terms of data,current domain adaptation methods face challenges in ensuring cross-domain consistency of features belonging to the same class,since it is difficult to gather cross-domain corpora with the same speaker.As a result,the same speaker may be misclassified into different categories in different environments.To address this issue,this dissertation proposes a cross-domain data augmentation method based on StarGAN,drawing inspiration from voice conversion techniques.This method explicitly models cross-domain distribution differences of the same category using singlespeaker multi-domain samples and learns to convert speech samples to the target domain while preserving speaker information.Subsequently,it performs cross-domain data augmentation on all data,increasing the intra-class diversity of the data and facilitating the learning of a domain-invariant speaker representation space.
Keywords/Search Tags:Speaker recognition, Domain adaptation, Knowledge distillation, Distribution alignment, Voice conversion, Adversarial generation
PDF Full Text Request
Related items