| The dependency of deep learning on massive training data is very strong because it requires a large amount of data to understand the underlying patterns.However,data collection is complex and expensive,making it extremely challenging to build largescale,high-quality annotated datasets.Audio event data collection is a complex and challenging task.On the one hand,audio signals are heavily affected by interference from the environment,devices,and other factors,which affects the quality and accuracy of audio data.On the other hand,the annotation cost for audio event data is high.Therefore,audio event classification tasks are usually performed with only a small amount of training data,and addressing the problem of insufficient audio training data is crucial for audio event classification tasks.Few-shot learning is the study of how to build models with generalization capabilities by learning from a small number of samples.Thesis proposes a few-shot learning approach based on transfer learning and data augmentation to address the problem of insufficient training data in audio event classification tasks.Ultimately,thesis designs multiple transfer learning schemes for few-shot audio event classification based on different base network models.Starting from the two mainstream deep learning modules,CNN and Transformer,thesis improves and implements three transfer learning models for audio event classification: a CNN model based on standard convolutional operations,a DSCNN model based on depthwise separable convolutional operations,and a Transformer model based on self-attention mechanisms.On the other hand,thesis compares the effects of data augmentation methods based on data mixing,including Mixup,Cutmix,and Spec Augment,on the generalization capability of the models in audio event classification tasks.Based on the characteristics and advantages of these data augmentation methods,a mixed time-frequency masking virtual sample generation method suitable for spectrograms is designed.In order to make better use of information in audio data and consider both the temporal characteristics of audio signals and the input characteristics of transfer learning models,the first-order difference and secondorder difference information of spectrograms are added to the input data.Finally,a multi-scale data augmentation method using mixed time-frequency masking suitable for small-sample audio data is innovatively proposed.The final experiments demonstrate that,the proposed multi-scale data augmentation method using mixed time-frequency masking achieves an improvement of 3 to 4 percentage points in accuracy compared to the baseline approach.After incorporating the proposed multi-scale data augmentation method using mixed timefrequency masking,the transfer learning scheme based on the DSCNN model achieves an accuracy of 94.6% on the small-sample audio event classification ESC-50 dataset.The transfer learning scheme based on the Transformer model achieves an accuracy of99.3% on the ESC-50 dataset,which is currently the best performance on the ESC-50 dataset. |