Font Size: a A A

Research On Classification Of Acoustic Scenes Based On Deep Learning

Posted on:2024-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:X H ShenFull Text:PDF
GTID:2568307127455034Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The acoustic scene classification task aims to identify the environment in which the filming device is located and has a strong practical application.At the beginning of the task researchers used traditional machine learning methods such as support vector machines and random forests to solve it.After the concept of deep learning was introduced,computer vision tasks have been developed significantly.Currently,acoustic scene classification tasks commonly use Meier spectrograms to process audio features,and the choice of neural networks is also mostly the mainstream approach in computer vision to improve accuracy by building deeper networks and using large-scale datasets.The huge computational volume requires very high storage devices and computational resources,resulting in portable devices that cannot meet this demand,which severely limits the deployment and application of acoustic scene classification on portable devices.Along with the lack of lightweight neural network models,there are differences in the amount of data collected by individual devices in the dataset,and there is also a mismatch in the number of samples.In past studies,it was shown that the two modalities,visual and auditory,are well complementary and help to facilitate the performance of scene recognition.While most of the studies have used audio information as auxiliary information to boost the visual task,there has never been a better solution on how to use visual information in the acoustic scene classification task.To address the above problems,the research on the acoustic scene classification problem in this paper is divided into three main aspects as follows:(1)In order to apply the model to portable devices and solve the sound scene classification problem with low complexity,some of the mainstream lightweight neural networks with better results are selected from the mainstream lightweight neural networks as the experimental objects under the limitation of the number of parameters.The effects of attention modules on lightweight neural networks are studied by changing the types of attention modules and the insertion positions.Finally,the lightweight neural network model suitable for the lowcomplexity acoustic scene classification task is constructed by considering the three aspects of model parameter number,training time and accuracy.(2)In order to solve the problem of mismatch in the number of device samples,an acoustic scene classification model based on the feature pair fusion method is proposed.The pairwise feature fusion method reduces the gap between the number of samples of each device while increasing the number of samples of each device,which effectively alleviates the multi-device sample mismatch problem.For the DCASE competition which is defined as invisible devices,i.e.,devices that are not in the training set but appear in the test set,the model uses the average spectral information extracted from some visible devices to simulate the spectral features of some invisible devices,making the classification accuracy of invisible devices much higher.By evaluating the effect on the TAU2020-Mobile dataset,the experimental results show that the proposed algorithm has better advantages over other methods in terms of model size and classification accuracy,which demonstrates the effectiveness as well as the feasibility of this method.(3)To verify the effectiveness of adding visual information,a visual information-assisted acoustic scene classification model is proposed.The model consists of a visual information screening module and a feature cross-fusion module.The visual information filtering module divides the audio features into several regions,and the visual features are weighted separately for each region to measure the similarity of acoustic scene information in the two parts of features.The feature cross-fusion module,on the other hand,uses a similarity matrix to select video frames that contribute to the classification of acoustic scenes from the level of a whole audio sequence.By evaluating the effect on the TAU2019 dataset,it shows that the proposed model in this paper has certain advantages and proves the effectiveness as well as the feasibility of the model.
Keywords/Search Tags:Acoustic Scene Classification, Lightweight Convolutional Neural Network, Attention Mechanism, Pairwise Feature Fusion, Deep Learning
PDF Full Text Request
Related items