| Automatic speaker verification(ASV)system is a system that performs identity verification based on the speaker’s voice information,and is currently widely used in various life scenarios such as mobile phone unlocking,intelligent access control,and bank identity verification.With the application of deep learning models in recent years,ASV systems have also made significant progress and demonstrated good performance.However,they are also susceptible to counterfeit attacks using synthesized or converted speech,and synthetic speech deepfake detection systems are dedicated to solving this problem.In this study,a multi-scale GMM-ResNet model is proposed for synthetic speech deepfake detection.The model mainly consists of two parts: multi-scale Log Gaussian Probability(LGP)feature fusion and Multi-scale Feature Aggregation ResNet(MFA-ResNet).The main elements are as follows:(1)For the problem of correlation between Gaussian components in the different order GMMs,this thesis proposes a synthetic speech deepfake detection method with multi-scale LGP feature fusion.GMMs describes the distribution of speech features in their space,and different orders GMMs have different descriptive abilities.LGP features calculated based on different orders GMMs also reflect the information contained in speech at different scales.Multi-scale LGP feature fusion weighted the three scales of LGP features obtained based on different orders GMMs and input the obtained features to the subsequent ResNet classifier.The purpose is to facilitate the information exchange between different scale LGP features.Multi-scale LGP feature fusion +ResNet model with min t-DCF=0.2488 and EER=2.62% in ASVspoof2021 logical access scenario.(2)For the problem of different levels of residual block output features in the ResNet model,this thesis proposes an MFA-ResNet model for synthetic speech deepfake detection.When training deep neural networks,the feature information obtained in the first or intermediate layers is also very useful for classification tasks.Based on this experience,the MFA-ResNet model improves the feature extraction capability of the network by aggregating the features output from each ResNet residual block and fully fusing the feature information of different layers within the network.Multi-scale LGP feature fusion and MFA-ResNet model are integrated to obtain multiscale GMM-ResNet model,which further improves the effectiveness of synthetic speech deepfake detection.Multi-scale GMM-ResNet model with min t-DCF=0.2442 and EER=2.43% in ASVspoof2021 logical access scenario. |