| Sound carry a large amount of information about our everyday environment and physical events that place in it.Audio scene recognition is an important audio signal processing application that aims to characterize sound in several common acoustic scenes by classifying audio recordings,such as beach,library,grocery and so on.There are many potential uses for audio scene recognition,such as military investigation and smart home,which have far-reaching research significance.The traditional audio scene recognition system is composed of two modules,namely feature extraction and classifier.Among them,the artificial selection in the feature extraction method is in a dominant position.The artificial selection feature requires the researcher to have a relevant professional background.This has a great dependence on the prior knowledge and experience of the researcher,as a result,a good feature is very rare.Deep learning can automatically learn features and overcome the shortcomings of artificial selection features.On the other hand,the traditional classifier has a simple structure and cannot solve complex classification problems,which hinders the development of audio scene recognition.Deep learning is a neural network with multi-layer perceptrons.It can simulate any nonlinear mapping under certain conditions and has achieved great success in image recognition,machine translation and other fields.In the study of audio scene recognition,deep learning can be used as a classifier on the one hand,and its deeper network structure means more powerful learning ability.On the other hand,under supervised learning,deep learning can automatically learn audio features,overcoming the shortcomings of laborious and unstable manual selection.So this paper tried many kinds of deep learning models to solve the audio scene recognition problem.A baseline system is implemented.The feature of the baseline system is MFCC and the classifier is GMM.It is a better one in the traditional audio scene recognition method,and the average recognition rate is 71.3%.In this paper,DNN and CNN deep learning networks are used to identify audio scenes.Firstly,two methods based on DNN are constructed.Their audio characteristics are MFCC and logarithm mel-spectrum,respectively.In these two networks,Re Lu replaces sigmoid as an activation function to reduce the probability of gradient saturation,and a dropout layer is added to the network to improve the generalization ability of the network.The average recognition rate of these two methods was 70.17% and 80.27%,respectively.Later,the idea of hierarchical classification was introduced to improve the DNN method based on logarithmic mel-spectrum.By analyzing the confusion matrix of its recognition results,the four scenes that are easily confused were taken as a major category to participate in the first classification,and then the ambiguous scenes were confused.For the second classification,the recognition rate of this method is 83.33%.Next,two methods based on the convolutional neural network are implemented for the logarithm mer spectrum feature and the CQT feature.To prevent the over-fitting of both networks,the BN mechanism was introduced and L2 regularization was used.And the recognition rates of the two methods were 83.4% and 82.71%,respectively.Then,an another system is implemented based on logarithmic mel-spectrum with CNN classifier.Along this system the output of the CNN's middle-tier network is extracted as an audio feature,which is composed of SVM and RF respectively.The recognition result is 83.7%and 86.3%.Finally,a recognition method based on feature fusion is implemented.The network consists of two similar sub-networks.Only the convolution kernels of the first convolution layer in the sub-network have different sizes.During training and testing,the logarithmic Mel spectrum of the audio is entered into both subnetworks simultaneously,and then The output of the second pooled layer of the two subnets is stitched to form a new feature input to the softmax layer.The recognition result of this method is 84.59%. |