Research On Synthetic Speech Deepfake Detection Based On Improved Transformer

Posted on:2024-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:H Yu

Full Text:PDF

GTID:2568307112476604

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Automatic Speaker Verification(ASV)systems are one of the most emerging applications in the development of speech technology.However,ASV systems usually face threats from different types of deepfake attacks to perform unauthorized access.Especially with the development of deep learning,deepfake speech becomes more and more realistic and easier to spoof automatic speaker verification devices.In order to dig more effective manual features and propose more robust networks,researchers have proposed many algorithms to detect such malicious attacks.In this thesis,we focus on synthetic speech deepfake detection as follows:In this thesis,we propose to apply the Transformer model to the speech deepfake detection task,which fails to take into account the distribution characteristics of real and synthesized speech on Gaussian component scores and ignores the long distance relationship between speech frames.This thesis employs Gaussian probability features as input features to better model the distribution characteristics of speech data,while exploiting the long-distance relationship modeling capability of the Transformer model to adaptively compute the correlation between each location and all other locations to produce global representation.This global representation improves detection performance because it allows the system to better understand the features of the entire input speech,rather than just the information of individual speech frames.The results show that the proposed algorithm achieves EER and min t-DCF of 3.97%and 0.2753 for the ASVspoof2021 LA dataset,respectively.The Transformer model can construct a global representation by using SelfAttention and multilayer perceptron to reflect complex spatial transformations and long-range feature dependencies.However,the Transformer model ignores local feature details,and the information that can represent speaker characteristics is mainly represented through the relationship between neighboring tokens.Therefore,this thesis proposes a new network model,Resformer,which combines the advantages of Transformer networks and convolutional neural networks to further improve the modeling of global features while also having the ability to capture local features.The results show that the model proposed in this thesis is able to further improve the performance on the ASVspoof2021 LA dataset compared with the Transformer model,with EER and min t-DCF reaching 2.78% and 0.2520,respectively.

Keywords/Search Tags:

Synthetic speech deepfake detection, Gaussian probability feature, Transformer, CNN, Resformer

PDF Full Text Request

Related items

1	Research On Deepfake Speech Detection Based On Improved Mimic Defense
2	Research On Synthetic Speech Deepfake Detection Based On Multi-scale GMM-ResNet
3	Research On Two-path BiLSTM And DCNN Model Based On Gaussian Probability Features For Speech Spoofing Detection
4	Research On Deepfake Detection Technology Based On Transformer
5	Temporal-Spatial Modeling Transformer For Deepfake Detection
6	Research On Deepfake Face Video Detection Algorithms Based On Muiti-feature Fusion
7	Research On Synthetic And Converted Speech Detection Based On Multi-branch Convolutional Neural Network
8	Audio Deepfake Detection Based On Frequency Domain Features And End-to-end Models
9	Research And Implementation Of Deepfake Detection Technology Based On Swin Transformer
10	Research On Speech Spoofing Detection Based On Feature Pyramid Residual Network