Font Size: a A A

Research On Multimodal Interactive Information Fusion Method Based On Transformer

Posted on:2024-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:K JiangFull Text:PDF
GTID:2568307094959159Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the expeditious advancement of the Internet and information technology,the multitude of diverse types of multimodal data,encompassing multimedia,text,and video,which individuals need to handle,is progressively increasing.The conventional singlemodal data is no longer sufficient to meet the developmental demands of the information era,and the emergence of vast amounts of multimodal data also poses the challenge of multimodal data fusion,that is,how to efficiently process and fuse these multimodal data having diverse structures and types.The investigation of multimodal data fusion holds paramount significance as it can aid us in comprehending the interconnections among different modes of data while also refining the caliber and dependability of the information.In recent years,an increasing number of researchers have commenced using deep learning techniques to address the issue of multimodal data fusion and have attained noteworthy outcomes.Based on the findings of prior studies,we further delve into the research on multimodal data fusion based on deep learning,which primarily encompasses the following facets:(1)The extraction of multimodal data features forms the basis of multimodal data fusion,and the quality of these features is a crucial factor that impacts the effectiveness of multimodal data fusion.Hence,this manuscript commences with the extraction of multimodal data features and proposes a technique for extracting features from textual,audio,and visual data.For textual data,this paper uses a pre-trained BERT model to extract features,while for audio and visual data,this paper employs Bi LSTM to perform feature extraction.The proposed approach effectively extracts high-quality multimodal features,as demonstrated by the results.Furthermore,this paper conducts a visual analysis of the feature distribution across different modalities to explore the feature distribution of each modal.This analysis establishes a foundation for subsequent research.(2)To address the issue of existing methods not fully leveraging the interaction information between multimodal data,this paper proposes a multimodal data fusion model based on multiple attention.Firstly,the proposed model employs LSTM to transform multimodal features and map the features of different modalities into the same subspace.Subsequently,three attentional modules(unimodal attentional module,bimodal attentional module,and multimodal attentional module)are utilized to extract the interaction information between different modalities.Lastly,a soft attention mechanism is applied to fuse the interaction information and filter redundant information,thereby reducing noise.Experimental results demonstrate the validity of the proposed model,which can effectively extract the interaction information between different modes.(3)To address the issue of poor robustness in existing multimodal data fusion models,this paper proposes a novel multimodal data fusion model based on adaptive coding.The proposed model introduces dynamic weights for the features of different modalities and maps them into the same subspace using LSTM to enhance the robustness of the model.Additionally,the consistency and complementarity features of the multimodal data are extracted using a Transformer-based encoder to explore the interaction information between the modalities.Finally,the resilience of the model to interference is improved by quantifying the uncertainty of predicted distribution using an improved loss function.Experimental results demonstrate that the proposed model has superior performance and robustness.
Keywords/Search Tags:Multimodal data fusion, Multimodal Interactive Information, Transformer, Attention Mechanism, Uncertainty Estimation
PDF Full Text Request
Related items