| Acoustic scene classification is one of the important research directions in the field of computer audition.Firstly,the collected audio signal is preprocessed,and then the higherorder acoustic features are extracted.Finally,the scene category information contained in the audio signal is resolved through model training,which has a wide application prospect in the fields of intelligent wearable devices,automatic noise reduction and inspection robots.With the rapid development of artificial intelligence and the arrival of the era of big data,the development of acoustic scene classification has been further promoted by a variety of excellent deep learning algorithms and a large amount of audio data,which has attracted more and more attention from researchers.At present,the existing acoustic scene classification methods are mostly based on the research results in the field of speech recognition.Because the acoustic scene signal has different characteristics from the speech signal,the actual effect of relevant research is not ideal,and there is still some room for improvement.In order to explore a better acoustic scene classification method,this thesis carries out research from the following points:(1)In order to illustrate the main process of acoustic scene classification task,the current mainstream research methods are analyzed from the aspects of acoustic features and classification model.The common time-domain features such as short-time zero crossing rate and short-time energy,the common time-frequency-domain features such as short-time Fourier spectrum,log-Mel spectrogram and constant-Q transform,and the basic classification models such as deep neural network,convolutional neural network,visual geometry group network and residual neural network are studied theoretically.The application range,advantages and disadvantages of various artificial acoustic features and classification models are analyzed and explained.(2)Aiming at the shortcomings of artificial acoustic features in acoustic scene classification,a joint optimization algorithm based on non-negative matrix decomposition and convolution neural network is proposed on the basis of automatic feature learning algorithm.Firstly,a non-negative matrix decomposition algorithm is used to extract interpretable higher-order features from STFT spectrum.Secondly,an appropriate convolutional neural network model is built and the extracted features are used to complete the pre-training of the model.Then,the process of feature extraction is reversely optimized according to the training effect of the model,and the direction of feature learning is adjusted adaptively to obtain the supervised discriminant features.Finally,the process of feature optimization and model training is repeated several times to achieve joint optimization of feature and model.Experimental results on the TUT2018 development dataset show that the classification accuracy of the joint optimization algorithm is improved by 2.10% compared with that before optimization,and is superior to other commonly used acoustic features,which verifies the effectiveness of the optimization algorithm.(3)By studying the differences between time-spectrum images and natural images,it is found that time-spectrum images have different characteristics in the frequency domain,and an automatic feature training method based on spectral decomposition is proposed.Firstly,the time spectrum is decomposed into several sub-spectra by frequency domain decomposition.Then,the sub-spectra of each frequency band are trained by non negative matrix decomposition to extract high-order features.Finally,combined with the joint optimization algorithm,the fusion model composed of independent sub-model and global classifier is built to complete the joint training of feature and model,which can achieve higher classification accuracy than before.(4)In order to better recover global information in the sub-model fusion stage,three model fusion strategies based on attention mechanism are proposed.In order to help the global classifier focus on the channel features that are most effective for classification,the weight values of each channel feature map can be allocated adaptively according to the importance of each channel feature map in the model fusion stage by introducing a squeeze-excitation convolutional attention mechanism.The simulation results show that the model fusion strategy based on attention mechanism can further improve the overall classification effect,with the highest classification accuracy of 71.77%. |