Font Size: a A A

Research On Speaker Recognition Based On X-vector

Posted on:2020-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:G D CaiFull Text:PDF
GTID:2428330575494933Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In order to improve the performance of the speaker recognition system,this paper is based on the x-vector system,combined with the convolutional neural network and attention mechanism,focusing on the problems of the x-vector system,and explore effective solutions.(1)Acoustic features are extracted by convolution neural network.MFCC is often used as input feature in the field of speech technology,but this empirical feature has some problems.In this paper,the most primitive acoustic parameter--spectrogram,is used as input feature,which contains more primitive speaker information.Meanwhile,using the mechanism of local perception and weight sharing of Convolutional Neural Network(CNN),the spectrogram is automatically optimized and dimensionality reduction is completed,thus avoiding the loss of information caused by empirical feature calculation.(2)The attention mechanism is applied to the calculation of the statistical layer.In the x-vector statistical layer,it is obviously unreasonable to calculate the mean and standard deviation of frame features directly and to assume that each frame is of the same importance by default.This paper introduces the attention mechanism to solve the above problems,mainly adopts two schemes:the first one is to introduces the attention layer,that is,to enhance the information of key frames and the internal correlation of speech signals through the attention layer,and through the multi-head attention to obtain different dependencies between sequences;the second one is to establish a attention-based statistical layer to directly modify the calculation of the statistical layer,calculate the weighted mean and standard deviation,and combine with the multi-head attention.(3)Experiments were performed on the VoxCeleb1 dataset using the Kaldi voice tools platform.The main contrast analysis is the impact of different acoustic characteristics and different network structures on system performance.The experimental results show that compared with the x-vector baseline system,the spectrogram combined with CNN is relatively reduced by 6.5%on EER;the introduction of attention layer scheme is relatively reduced by 13.5%on EER;and the attention-based statistical layer scheme is relatively reduced by 25.5%on EER.The experimental results show that the proposed scheme is reasonable and effective:extracting and optimizing features directly from the spectrogram by CNN and improving the calculation of x-vector statistical layer by attention mechanism.
Keywords/Search Tags:Spectrogram, Speaker Recognition, Attention Mechanism, CNN
PDF Full Text Request
Related items