| In recent years,speech recognition systems have been widely used in various fields,and audio adversarial sample attacks have also increased rapidly.The impact of attacks has become increasingly serious,and the problem of audio adversarial sample detection needs to be solved urgently.Audio adversarial sample attack is to add perturbations that are imperceptible to the human ear to the original audio to tamper with the recognition results of the automatic speech recognition system(ASR).Most of the existing audio adversarial sample detection methods have achieved good results by observing the difference in the ASR recognition results of the audio before and after specific preprocessing.However,these methods still have problems such as poor detection of unknown attacks on the whole sentence,weak generalization ability of the detection model,and poor detection of keyword tampering disturbances.Accordingly,this paper proposes two different audio adversarial sample detection methods TP(Tag-Position)and FLA(Frame Level with Alexnet Adversarial Detection for Audio Samples),which are used for sentence attack detection and keyword tampering detection respectively,as follows:Aiming at the problems that the existing whole-sentence attack detection methods have poor detection effect on unknown attacks and weak model generalization ability,this paper proposes a label position encoding method,and based on this encoding method,an audio adversarial sample detection method TP is proposed,and the label information Together with the location information,it is integrated into the Transformer model.By extracting the spectrogram features of the audio and dividing it into multiple patches,each patch is encoded with a label position.The patch feature vector encoded by the label position will be used as the input of the encoder,and finally the classification layer is used to classify the attack samples.identification to detect adversarial attacks.The experimental results show that the joint introduction of label and location information can help improve the detection effect,and in the face of unknown attacks,this method still has a good detection effect and shows good robustness.In order to improve the detection effect of keyword tampering in audio adversarial samples,this paper proposes a frame-level audio adversarial sample detection method FLA.This method adopts the method of sub-frame processing,uses the time dimension to detect the high-frequency and unstable disturbance frequency on each frame,and then extracts the feature information from each frame through MFCC,and uses the feature information as the Alex Net backbone network.enter.The Alex Net network structure learns and trains features through convolution and pooling layers,and finally uses the trained model to detect keyword tampering attacks.The experimental results show that the FLA method proposed in this paper can better solve the problem of unsatisfactory keyword tampering detection,and has a good detection effect.To sum up,this paper focuses on the two main audio adversarial sample attack detection methods of sentence attack and keyword tampering,gives effective solutions and strategies,designs reasonable and effective models and algorithms,and tests them out through experiments.Support and Validation.This paper provides an effective solution for the security of the speech recognition system,and provides a valuable reference and reference for further research on audio adversarial sample detection. |