Font Size: a A A

Research On Anti-spoofing Speaker Verification Method Based On Pyramid Pooling

Posted on:2024-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z K WanFull Text:PDF
GTID:2568307130953409Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Speaker verification technology is a biometric technique that uses speech signals to determine the identity of a speaker.In practical identity verification scenarios,fraudsters can create fraudulent speech signals that are very similar to authentic ones through techniques such as voice conversion and speech synthesis.Currently,speaker verification methods are challenged by the difficulty of detecting subtle differences between fraudulent and authentic speech signals,allowing fraudulent speech signals to easily pass through speaker verification models,which poses a serious threat to the security of the speaker verification field.Spoofing countermeasures can be used to distinguish between genuine and fake speech.However,independent spoofing countermeasures only focus on fraud detection tasks and ignore subsequent speaker verification processes.To better protect speaker verification from fraud and maintain its ability to identify speaker identities,anti-spoofing speaker verification has received increasing attention.Due to the correlation between spoofing countermeasures and speaker verification,the performance of anti-spoofing speaker verification methods is influenced by both speaker verification and spoofing countermeasures.However,there are still some issues with speaker verification and spoofing countermeasures that lead to suboptimal performance.For speaker verification,the use of convolutional operations greatly improves performance,but most existing algorithms rarely combine global and local prior information and are affected by the loss of local receptive fields generated by convolutional operations.For spoofing countermeasures,the use of deep learning to classify fraudulent and authentic speech signals increases fraud detection accuracy,but most existing deep classification methods rely on classification loss to optimize models,ignoring the similarity and difference between samples,which affects the learning of representation distribution.Given the challenges and difficulties mentioned above,this thesis proposes a speaker verification method based on statistical pyramid pooling and an spoofing countermeasure method based on supervised contrastive learning.The main contributions are summarized as follows:(1)A speaker verification method based on statistical pyramid pooling is proposed.To address the problem of global information loss caused by insufficient receptive field in the convolutional layer,a Statistical Pyramid Pooling Dense Time delay neural network(SPD-TDNN)model was proposed.SPD-TDNN uses a statistical pyramid pooling module to extract multi-scale contextual prior information,which includes global and local context branches.The global context branch relies on the global average and standard deviation in the time domain to extract global prior information,while the local context branch relies on mean pooling layers of different sizes to extract multi-scale local prior information.Experiments on the Vox Celeb1&2 dataset show that compared with the state-of-the-art speaker verification models D-TDNN,D-TDNN-SS,and ECAPA-TDNN,the proposed SPD-TDNN achieves more advanced performance,reducing the EER,Min DCF(0.01),and Min DCF(0.001)metrics by 0.45%,0.05,and 0.05,respectively,compared to the baseline model D-TDNN.Experiments on the Vox Celeb2 and ASVspoof2019 datasets show that the anti-spoofing speaker verification method using SPD-TDNN outperforms those using ECAPA-TDNN and D-TDNN-SS,achieving the best result on the SASV-EER metric.(2)An spoofing countermeasure method based on supervised contrastive learning is proposed.To address the issue of suboptimal feature representation caused by ignoring the similarity and dissimilarity between samples in the classification loss,the Momentum Contrast Light Convolutional Neural Network(MC-LCNN)model is proposed.Specifically,MC-LCNN performs joint optimization of supervised contrastive loss and classification loss,with the former helping the model obtain better feature distribution by calculating the similarity between positive and negative sample pairs,which aids in detecting spoofed speech generated using unknown attack methods.The use of storage blocks decouples the number of positive and negative sample pairs and the batch size of the current training sample,providing more sample pairs for supervised contrastive learning and improving the effectiveness of the approach by allowing the model to learn features that distinguish real and spoofed speech more easily.Extensive experiments on the ASVspoof2019 dataset show that the proposed method outperforms the baseline countermeasure model LCNN,reducing the EER by 4.28%and min-t DCF by 0.09.On the Vox Celeb2 and ASVspoof2019 datasets,the anti-spoofing speaker verification method using MC-LCNN outperforms two baseline anti-spoofing speaker verification models in the SASV2022 Challenge by 0.43% and 5.09% respectively on the SASV-EER metric,and outperforms the anti-spoofing speaker verification method using the baseline countermeasure model by 1.40% on the SASV-EER metric,demonstrating the effectiveness of improving the countermeasure performance in enhancing anti-spoofing speaker verification performance.(3)A prototype anti-spoofing speaker verification system is designed and implemented.Based on the above research results,this system uses the Python programming language and Py Torch deep learning framework to implement the algorithms and uses MATLAB to design the graphical user interface of the anti-spoofing speaker verification system.This prototype system can demonstrate and verify the effectiveness,practicality,and robustness of the proposed methods.
Keywords/Search Tags:Anti-spoofing speaker verification, Deep learning, Pyramid pooling, Supervised contrastive learning
PDF Full Text Request
Related items