Audio-Visual Multi-Modal Fusion Approach Research And Application

Posted on:2024-03-24

Degree:Master

Type:Thesis

Country:China

Candidate:P Yang

Full Text:PDF

GTID:2568307097471454

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Multimodal fusion is a very key research direction in artificial intelligence,multimodal machine learning and other fields.The main goal is to build a modal that processes and understands multimodal information,combining data from different modals to achieve more robust and accurate predictions of the same event.At present,multimodal fusion technology has been widely used in a variety of complex and challenging tasks,including audiovisual speech recognition,audiovisual multimodal emotion recognition,multimodal medical image analysis,etc.In the research of multimodal fusion,video and audio are the most important modal.Therefore,this paper focuses on the fusion method based on sound and video modal.However,there are some problems in the multi-modal data fusion of audio and video,such as unmatched sampling rates and different information lengths.In addition,how to effectively carry out multi-modal information fusion,give full play to the complementarity of different modals,solve the problem of multi-modal data heterogeneity,etc.the research of multi-modal fusion method also brings great challenges.To solve the above problems,this paper studies the key technologies of multimodal audio and video fusion methods,and proposes two different fusion methods,which are applied to multimodal audiovisual speech recognition and emotion recognition tasks.The main work contents and research results are as follows:1)For single-modal speech recognition tasks.The recognition rate of lip reading is much lower than that of speech recognition,and the speech signal is easily damaged by noise.The recognition rate of existing visual speech recognition methods in large vocabulary tasks is greatly reduced.This paper presents a multi-modal audiovisual speech recognition(MAVSR)method.Based on the self-attention mechanism,the two-stream front-end coding modal is constructed,and a modal controller is introduced to solve the problem of uneven recognition performance of each modal caused by the dominant audio modal,so as to improve the recognition stability and robustness.A multi-modal feature fusion network based on attention mechanism and convolution is constructed to solve the problem of heterogeneous audio and video data and improve the relevance and complementarity between audio and video modal.This method can accomplish speech recognition tasks in three tasks: audio only,video only and audio and video fusion.2)When implementing cross-modal fusion,the existing multi-modal emotion recognition methods fail to fully consider the feature interaction between different modals and the loss of original semantic information.This paper proposes a cross-modal fusion method based on the interactive attention mechanism.Firstly,an audio and video feature extraction network is constructed.Then,the features of the two modals are input into the cross-modal block respectively,and the interactive attention weight of audio and video modal is calculated by the interactive attention mechanism.Secondly,the obtained audio and video fusion features are fused with the original features to ensure the complementarity and completeness of the fusion features.Finally,the effectiveness of the proposed method is verified on RAVDESS data set.The results of emotion recognition can be obtained under three kinds of tasks: audio only,video only and audio and video fusion.Experimental results show that the accuracy of audio and video fusion method is more than 5% higher than that of single modal recognition method,and it is obviously better than other mainstream methods.

Keywords/Search Tags:

multimodal fusion, audio-visual speech recognition, multimodal emotion recognition, attention mechanism, multi-modal machine learning

PDF Full Text Request

Related items

1	Research On Emotion Recognition Of Monomodal Speech And Multimodal Speech Vision Based On Transfer Learning
2	Research On Multi-modal Emotion Recognition Based On UDP-MIF
3	Research And Implementation Of Emotion Recognition Technology Based On Multimodal Fusion
4	The Study Of Multimodal Emotion Recognition Based On Text,Speech And Video
5	Multimodal Emotion Recognition Based On Audio And Video
6	A Study Of Deep Learning Based Multimodal Emotion Recognition
7	Research And Design Of Speech And Text Fusion Multimodal Emotion Recognition Scheme Based On Deep Learning
8	Research On Multimodal Emotion Recognition Combining Audio And Text Based On Deep Learning
9	Multi-modal Emotion Recognition Based On Deep Learning
10	Research On Multi-modal Emotion Recognition Method Combining Speech And Expression