| Multimodal fusion is a very key research direction in artificial intelligence,multimodal machine learning and other fields.The main goal is to build a modal that processes and understands multimodal information,combining data from different modals to achieve more robust and accurate predictions of the same event.At present,multimodal fusion technology has been widely used in a variety of complex and challenging tasks,including audiovisual speech recognition,audiovisual multimodal emotion recognition,multimodal medical image analysis,etc.In the research of multimodal fusion,video and audio are the most important modal.Therefore,this paper focuses on the fusion method based on sound and video modal.However,there are some problems in the multi-modal data fusion of audio and video,such as unmatched sampling rates and different information lengths.In addition,how to effectively carry out multi-modal information fusion,give full play to the complementarity of different modals,solve the problem of multi-modal data heterogeneity,etc.the research of multi-modal fusion method also brings great challenges.To solve the above problems,this paper studies the key technologies of multimodal audio and video fusion methods,and proposes two different fusion methods,which are applied to multimodal audiovisual speech recognition and emotion recognition tasks.The main work contents and research results are as follows:1)For single-modal speech recognition tasks.The recognition rate of lip reading is much lower than that of speech recognition,and the speech signal is easily damaged by noise.The recognition rate of existing visual speech recognition methods in large vocabulary tasks is greatly reduced.This paper presents a multi-modal audiovisual speech recognition(MAVSR)method.Based on the self-attention mechanism,the two-stream front-end coding modal is constructed,and a modal controller is introduced to solve the problem of uneven recognition performance of each modal caused by the dominant audio modal,so as to improve the recognition stability and robustness.A multi-modal feature fusion network based on attention mechanism and convolution is constructed to solve the problem of heterogeneous audio and video data and improve the relevance and complementarity between audio and video modal.This method can accomplish speech recognition tasks in three tasks: audio only,video only and audio and video fusion.2)When implementing cross-modal fusion,the existing multi-modal emotion recognition methods fail to fully consider the feature interaction between different modals and the loss of original semantic information.This paper proposes a cross-modal fusion method based on the interactive attention mechanism.Firstly,an audio and video feature extraction network is constructed.Then,the features of the two modals are input into the cross-modal block respectively,and the interactive attention weight of audio and video modal is calculated by the interactive attention mechanism.Secondly,the obtained audio and video fusion features are fused with the original features to ensure the complementarity and completeness of the fusion features.Finally,the effectiveness of the proposed method is verified on RAVDESS data set.The results of emotion recognition can be obtained under three kinds of tasks: audio only,video only and audio and video fusion.Experimental results show that the accuracy of audio and video fusion method is more than 5% higher than that of single modal recognition method,and it is obviously better than other mainstream methods. |