Research On Methods Of Improving The Representation Ability Of Speaker Recognition Models

Posted on:2022-12-03

Degree:Master

Type:Thesis

Country:China

Candidate:R T Zhang

Full Text:PDF

GTID:2558307154474944

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the application of speaker recognition in real scenes,traditional speaker recognition systems based on statistical modeling have become increasingly challenging in complex scenarios.Deep neural networks have strong feature representation capabilities and are widely used in speaker recognition systems.The current mainstream speaker recognition networks include time-delayed neural networks(TDNN)and deep residual convolutional neural networks(Res-CNNs).Although TDNNs can capture time-series information well,it lacks the depth characterization of features.Although Res-CNNs can perform high-dimensional representation of speakers,it does not consider the complete frequency dimension information.Additionally,dynamic speech features contain the multi-type information of the personalized features,but mainstream networks lack the multi-scale representation for speakers.Firstly,this paper first proposes the aggregated residual extended time-delay neural network(ARET)based on the “split-transform-aggregate” transformation to improve the representation of the speaker recognition network.ARET model can consider the complete frequency dimension to extract the speaker representation.The introduction of residual connections makes the model have a good deep representation for speakers.At the same time,the “split-transform-aggregate” transformation gives ARET a more refined timing information modeling capability and an efficient network structure.We evaluated the proposed models in three large-scale speaker recognition datasets: VoxCeleb1 test set,VoxCeleb1-E,and VoxCeleb1-H.The results show that the proposed model has a significant improvement than the baseline systems.Then,this paper evaluated the proposed model in the 2020 short-duration speaker verification challenge(held by Interspeech,an important international conference in speech),and won the 7-th place.The results show that the system in this paper achieves a 2.7% equal error rate(EER),which is 70% lower than the official baseline system.Secondly,this paper proposes a multi-branch time-delay neural network(RepTDNN)based on re-parameterization to improve the multi-scale feature modeling capability of the speaker recognition model.Rep-TDNN adopts a re-parameterization strategy.In the training period,Rep-TDNN adopts a multi-branch topology to achieve multiscale feature modeling capabilities.In the inference period,the model is re-weighted to a single branch model,which maintains the multi-scale feature capture and increases the inference speed.The model is also evaluated in the VoxCeleb dataset.The results show that the model achieves a 1.3% EER in the test set of the VoxCeleb1,which is20%–30% lower than the best baseline system.In summary,this paper focuses on improving the representational ability of the speaker recognition model,and proposes two high high-performance models: ARET and Rep-TDNN.This work is conducive to the application of voiceprint recognition in real scenes.

Keywords/Search Tags:

Speaker recognition, time-delay neural network, residual transformation, split-transform-merge transformation, multi-scale modeling, re-parameterization

PDF Full Text Request

Related items

1	Research On Feature Transformation And Robust Technology With Speaker Identification
2	End-to-End Speaker Embedding For Speaker Recognition In The Wild
3	License Plate Recognition Algorithm And System Design
4	Research On Speaker Recognition Method Based On Deep Learning
5	Research On Text-independent Speaker Recognition Based On Attention Mechanism
6	Unrelated Phone Voice Speaker Recognition Based On Feature Transformation And Classification Of Text
7	Research On Multi-source Image Fusion Algorithm Based On Multi-scale Transformation And Neural Network
8	Research Of Recognition And Transformation Of Speech Signal
9	A New Non-linear Spectrum Transformation For Speaker Recognition
10	Speaker Recognition Based On Multi-Resolution Frequency Features And Parallel Neural Network