Font Size: a A A

Research On Synthetic Speech Deepfake Detection Based On Improved Transformer

Posted on:2024-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:H YuFull Text:PDF
GTID:2568307112476604Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Automatic Speaker Verification(ASV)systems are one of the most emerging applications in the development of speech technology.However,ASV systems usually face threats from different types of deepfake attacks to perform unauthorized access.Especially with the development of deep learning,deepfake speech becomes more and more realistic and easier to spoof automatic speaker verification devices.In order to dig more effective manual features and propose more robust networks,researchers have proposed many algorithms to detect such malicious attacks.In this thesis,we focus on synthetic speech deepfake detection as follows:In this thesis,we propose to apply the Transformer model to the speech deepfake detection task,which fails to take into account the distribution characteristics of real and synthesized speech on Gaussian component scores and ignores the long distance relationship between speech frames.This thesis employs Gaussian probability features as input features to better model the distribution characteristics of speech data,while exploiting the long-distance relationship modeling capability of the Transformer model to adaptively compute the correlation between each location and all other locations to produce global representation.This global representation improves detection performance because it allows the system to better understand the features of the entire input speech,rather than just the information of individual speech frames.The results show that the proposed algorithm achieves EER and min t-DCF of 3.97%and 0.2753 for the ASVspoof2021 LA dataset,respectively.The Transformer model can construct a global representation by using SelfAttention and multilayer perceptron to reflect complex spatial transformations and long-range feature dependencies.However,the Transformer model ignores local feature details,and the information that can represent speaker characteristics is mainly represented through the relationship between neighboring tokens.Therefore,this thesis proposes a new network model,Resformer,which combines the advantages of Transformer networks and convolutional neural networks to further improve the modeling of global features while also having the ability to capture local features.The results show that the model proposed in this thesis is able to further improve the performance on the ASVspoof2021 LA dataset compared with the Transformer model,with EER and min t-DCF reaching 2.78% and 0.2520,respectively.
Keywords/Search Tags:Synthetic speech deepfake detection, Gaussian probability feature, Transformer, CNN, Resformer
PDF Full Text Request
Related items