| Audio scene recognition is a new research field in recent years.It aims to categorize scenes through background.Intelligent devices can make use of background information extracted from the current audio to adjust the parameters of the system or application to meet the personalized needs of users.Audio scenes usually show a high degree of variability.This variability not only exists between different scenarios,but also within the same scene.As a result,audio scene recognition is arguably one of the most challenging tasks in statistical pattern recognition at present.Compared with some traditional fields of audio processing,such as speech recognition,there is still a big gap in the accuracy of audio scene recognition.In this paper,to solve the problem of low classification accuracy in current audio scene recognition,from the aspects of audio processing,signal representation,feature extraction,design of classification model and so on,a kind of audio scene recognition method based on neural network is proposed.The purpose of this study is to get an effective and feasible audio scene recognition system.In a laboratory environment,suitable audio data sets are used to evaluate the system.The detailed work is as follows:(1)For the signal processing module,three data augmentation methods are used.The central and side channel are separated from the binaural stereo sound.The harmonic source and impulse source are separated from mono channel audio.The background difference method with different median filter sizes is used to process the generated spectrum,and the obtained data is used to train the classifier model.(2)For the feature extraction module,Mel-frequency cepstral coefficient is adopted.The appropriate frame length,frame shift and the number of filters are designed to ensure the feature while greatly reducing the feature dimension and the computational complexity.(3)For the design of classification system,after understanding the principle and method of neural network classifier,the most appropriate convolution neural network is selected.Meanwhile,according to the number of input signal channels,two different convolution neural network structures have been designed,one for single channel input signal,another for binaural input signal.The experimental results show that these two network structures have stronger learning ability than the simple convolutional neural network.(4)For ensemble learning module,following each part of the tasks of different sub-models,the integration method is used to integrate the results of all sub-model experiments,and appropriate weight parameters are set to obtain the final classification results.The accuracy of the integrated learning experiment is greatly improved compared with that of the sub-models.According to theoretical analysis and experiments,data augmentation processing increases the volume of audio data and provides more experimental samples for feature extraction and classifier training.Compared with the traditional pattern recognition method GMM,the proposed two network structures obtain the performance improvement up to 5.4%.Compared with single classifier network,the classifier based on ensemble method has better classification performance. |