Font Size: a A A

Research On Methods Of Improving The Representation Ability Of Speaker Recognition Models

Posted on:2022-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:R T ZhangFull Text:PDF
GTID:2558307154474944Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the application of speaker recognition in real scenes,traditional speaker recognition systems based on statistical modeling have become increasingly challenging in complex scenarios.Deep neural networks have strong feature representation capabilities and are widely used in speaker recognition systems.The current mainstream speaker recognition networks include time-delayed neural networks(TDNN)and deep residual convolutional neural networks(Res-CNNs).Although TDNNs can capture time-series information well,it lacks the depth characterization of features.Although Res-CNNs can perform high-dimensional representation of speakers,it does not consider the complete frequency dimension information.Additionally,dynamic speech features contain the multi-type information of the personalized features,but mainstream networks lack the multi-scale representation for speakers.Firstly,this paper first proposes the aggregated residual extended time-delay neural network(ARET)based on the “split-transform-aggregate” transformation to improve the representation of the speaker recognition network.ARET model can consider the complete frequency dimension to extract the speaker representation.The introduction of residual connections makes the model have a good deep representation for speakers.At the same time,the “split-transform-aggregate” transformation gives ARET a more refined timing information modeling capability and an efficient network structure.We evaluated the proposed models in three large-scale speaker recognition datasets: VoxCeleb1 test set,VoxCeleb1-E,and VoxCeleb1-H.The results show that the proposed model has a significant improvement than the baseline systems.Then,this paper evaluated the proposed model in the 2020 short-duration speaker verification challenge(held by Interspeech,an important international conference in speech),and won the 7-th place.The results show that the system in this paper achieves a 2.7% equal error rate(EER),which is 70% lower than the official baseline system.Secondly,this paper proposes a multi-branch time-delay neural network(RepTDNN)based on re-parameterization to improve the multi-scale feature modeling capability of the speaker recognition model.Rep-TDNN adopts a re-parameterization strategy.In the training period,Rep-TDNN adopts a multi-branch topology to achieve multiscale feature modeling capabilities.In the inference period,the model is re-weighted to a single branch model,which maintains the multi-scale feature capture and increases the inference speed.The model is also evaluated in the VoxCeleb dataset.The results show that the model achieves a 1.3% EER in the test set of the VoxCeleb1,which is20%–30% lower than the best baseline system.In summary,this paper focuses on improving the representational ability of the speaker recognition model,and proposes two high high-performance models: ARET and Rep-TDNN.This work is conducive to the application of voiceprint recognition in real scenes.
Keywords/Search Tags:Speaker recognition, time-delay neural network, residual transformation, split-transform-merge transformation, multi-scale modeling, re-parameterization
PDF Full Text Request
Related items