Font Size: a A A

Research On Robust Voiceprint Verification Method Based On Deep Learnin

Posted on:2024-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y B DuanFull Text:PDF
GTID:2568307049482574Subject:Engineering
Abstract/Summary:PDF Full Text Request
Voiceprint recognition,also known as speaker recognition(SR),is a biometric recognition technology.Generally speaking recognition technology can be divided into two sub-tasks,namely speaker identification and speaker verification.Compared with other biometric features,voiceprint features are easy to use,simple acquisition equipment,and difficult to imitate.With the rapid development of deep learning methods,the modeling of voiceprint recognition tasks using deep neural networks has been widely used.At the same time,it also caused more and more problems related to voiceprint recognition tasks.How to solve the impact of factors such as differences between speaking environments,differences in speaker’s physical state,changes in speaker’s mood and changes in speaker’s age on the voiceprint recognition task is an urgent problem to be solved.So how to extract a robust voiceprint feature with the help of deep learning has become a research hotspot in the field of voiceprint recognition.This paper mainly explores the method of robust voiceprint verification based on deep learning from two aspects of model framework and data augmentation,and improves the performance of voiceprint verification in interview scenes and movie scenes by extracting robust speaker features.The work of this paper can be summarized as follows:(1)The ECAPA-TDNN model based on time-delay neural network is constructed,and the performance of the model in voiceprint verification of interview scenes is verified.After that,the performance of the low-resource training model in the interview scene and film scene test data set is further explored.(2)A two-model regularization training framework is constructed,which uses different acoustic representations of the middle layer,and regularizes the model through self-supervised learning.In addition,the self-monitoring loss and the monitoring loss are combined by the time-related weight to enhance the correlation between the two models.In order to make better use of the complementary information in the output of the dual model,this paper also studies the complementarity of different branch scores of the dual model,thus further improving the performance of the voiceprint verification system.(3)By cutting and mixing different levels of acoustic features,the cutting and mixing data augmentation technology for voice signals is innovatively proposed.This technology is combined with other data augmentation methods to greatly improve the robustness of the speaker recognition system.This paper uses the internationally published VoxCeleb and VoxMoves data sets to verify the experimental performance on their corresponding VoxCeleb1-O and VoxMovies E-1 to E-5 test sets.Compared with the classical ECAPA-TDNN,the regularized dual model has significantly improved its performance on the above test set.The fractional fusion between different model branches of the regularized dual model has achieved excellent performance in the above test sets,with a relatively 10.4%EER reduction in VoxCeleb1-O,and a 9.1% to 11.6% EER reduction in cross-domain test sets E-1 to E-5.The cutting and mixing data augmentation technology of voice signal has obtained better performance than other data augmentation methods without increasing the time cost.This method is combined with the noised data augmentation method,which only needs one third of the training time of the six data augmentation methods,and achieves better performance than the six data augmentation methods.The EER was reduced by 22.4% in VoxCeleb1-O and 10% to 16.5% in VoxMovies E-1 to E-5.
Keywords/Search Tags:Deep learning, Speaker recognition, ECAPA-TDNN, Regularization, Data Augmentation
PDF Full Text Request
Related items