Font Size: a A A

Detection Of Disguised Voice Based On Deep Residual Network

Posted on:2021-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:M G ZhangFull Text:PDF
GTID:2428330602986101Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Disguised voice can attack automatic speaker verification(ASV)systems by hiding speaker's identity or by impersonating a target.Among the disguising operations,voice transformation(VT)changes speaker's voice while maintaining acoustic naturalness,and thus hides speaker's identity,which can be implemented by many existing audio editing tools easily.Recaptured voice is another disguising operation which attacks ASV by recording target's voice.Reported efforts have revealed that these two disguising operations can deceive today's ASV systems by drastically raising false reject rate and false acceptance rate,respectively,and present challenges to society security.Therefore,studies of the detection of these two operations is of great significance.In this thesis,VT and recaptured voice detection methods based on depth residual network structure are studied,which can automatically extract deep features with a strong detection capability.The main contributions are as follows:1.For VT detection,we construct a depth residual convolution neural network which consists of 16 special residual blocks,and each block consists of three layers.The structure can learn deep acoustic features,and no gradient explosion occurs with increment of network layers,resulting in no degradation phenomenon.In the experiment,three corpora were tested.In the intra-database evaluation,all the results were above 96.4%.In the cross-database evaluation,the accuracy is above 96.43%.In the detection of the minimum disguising factor,i.e.?4,all accuracy rates are higher than 96.1%.The proposed method outperforms the reported efforts.2.For the detection of recaptured voice,we construct a depth residual network,which consists of 15 residual blocks,and each block consist of two layers.The neural network structure can extract features from very short speech segments.In the experiments,various factors including recording equipments,recording distances and recording environments are taken into consideration.The results show that the accuracy rates can achieve more than99.8% by merging all data from different sets of equipments,distances and environments.The proposed detection methods for VT and recaptured voice in this thesis can enhance ASV robustness,which is of great significance social security.
Keywords/Search Tags:voice transformation, recaptured voice, ASV, residual network, convolution
PDF Full Text Request
Related items