Multimodal Scene Classification Based On Audio Image Collaboration

Posted on:2024-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:T Y Huang

Full Text:PDF

GTID:2568307136991949

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

The research on multimodal scene classification aims to detect the scene category of audio image segments based on a given audio image segment.Considering the importance of modal fusion and the ability of hidden layer features in encoders to compress duplicate information to obtain a better pattern,our team submitted a multimodal scene classification system in Challenge on Detection and Classification of Acoustic Scenes and Events 2021 Task 1B,Based on this,this article has made modifications in conjunction with the encoder network.This article systematically combines early fusion,attention mechanism,and encoder group network to fuse visual and auditory features to improve classification accuracy.The main work of this article is as follows:(1)Considering the insufficient information in single mode and the significant difference in feature information distribution between dual mode states,this paper proposes the fusion of information between dual mode states using an autoencoder group.At the same time,based on the fusion of the autoencoder group,a mutual encoder fusion is proposed for information fusion between dual mode states.This article uses the dataset from task 1B of the competition for experiments,and uses accuracy as an evaluation indicator to obtain 86.06% and 87.01% accuracy,respectively.Comparing the results of task 1B with the baseline system classification accuracy of 77.02%,it shows that the mutual fusion of information between the two modalities can improve scene classification performance.(2)Considering the excellent performance of dual fusion and attention mechanism in scene classification,this paper proposes self encoder group assisted hearing,mutual encoder assisted hearing,self encoder assisted vision,and mutual encoder assisted visual experiments.The experiment utilized attention mechanism to assist decision-making through the fusion of bimodal information from the encoder.The experiment compared the results of the baseline system given in the competition task 1B,proving the effectiveness of the system.This indicates that the dual fusion of multimodal features under the attention mechanism has superior performance and can improve classification accuracy.

Keywords/Search Tags:

Audio-Visual Scene Classification, Self-attentional Mechanism, Multimodal Learning, Variational Autoencoder

PDF Full Text Request

Related items

1	Multimodal Scene Classification Algorithm Based On Self-attention
2	Deep Probabilistic Generative Models Based On Multimodal Variational Inference
3	Research On Semantic Analysis And Understanding Of Multimodal Video
4	Audio-Visual Multi-Modal Fusion Approach Research And Application
5	Research On Voice Conversion System Based On Vector Quantized Variational Autoencoder
6	Visual Storytelling Based On Vision Transformer
7	Research On Image Generation Algorithm Based On Variational Auto-Encoder
8	Research On Low-light Image Enhancement Based On Variational Autoencoder
9	Research On Multimodal Emotion Recognition Based On Natural Language Characteristics
10	Multimodal Cognitive Learning For Audio-visual Data