| The research on multimodal scene classification aims to detect the scene category of audio image segments based on a given audio image segment.Considering the importance of modal fusion and the ability of hidden layer features in encoders to compress duplicate information to obtain a better pattern,our team submitted a multimodal scene classification system in Challenge on Detection and Classification of Acoustic Scenes and Events 2021 Task 1B,Based on this,this article has made modifications in conjunction with the encoder network.This article systematically combines early fusion,attention mechanism,and encoder group network to fuse visual and auditory features to improve classification accuracy.The main work of this article is as follows:(1)Considering the insufficient information in single mode and the significant difference in feature information distribution between dual mode states,this paper proposes the fusion of information between dual mode states using an autoencoder group.At the same time,based on the fusion of the autoencoder group,a mutual encoder fusion is proposed for information fusion between dual mode states.This article uses the dataset from task 1B of the competition for experiments,and uses accuracy as an evaluation indicator to obtain 86.06% and 87.01% accuracy,respectively.Comparing the results of task 1B with the baseline system classification accuracy of 77.02%,it shows that the mutual fusion of information between the two modalities can improve scene classification performance.(2)Considering the excellent performance of dual fusion and attention mechanism in scene classification,this paper proposes self encoder group assisted hearing,mutual encoder assisted hearing,self encoder assisted vision,and mutual encoder assisted visual experiments.The experiment utilized attention mechanism to assist decision-making through the fusion of bimodal information from the encoder.The experiment compared the results of the baseline system given in the competition task 1B,proving the effectiveness of the system.This indicates that the dual fusion of multimodal features under the attention mechanism has superior performance and can improve classification accuracy. |