Font Size: a A A

Reaearch On Multimodal Data Fusion Method Based On Deep Learning And D-S Evidence Theory

Posted on:2024-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y T GengFull Text:PDF
GTID:2568307097957039Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,multi-modal data such as text,audio,image and video generated in various fields are increasing and accumulating at an unprecedented speed.Since multimodal data contains a wealth of information,how to extract the most critical information for event or topic analysis,use the correlation information between different modalities,and correctly measure and deal with the conflicting information between different modal data becomes Maj or Difficulties in Multimodal Data Fusion.In view of the above difficulties,this paper takes audio data and image data as the research object,and conducts the following research work:(1)In order to extract deep features with obvious distinguishing ability from redundant multi-modal data,this paper constructs a Stack Self-attention audio deep feature extraction network based on the attention mechanism and an image deep feature extraction network based on Cutout and CBAM.In the audio feature extraction network,the audio is first preprocessed with MFCC to obtain features that are more in line with the characteristics of human hearing,and then the Stack Self-attention network is used to extract higherdimensional features of audio information,and the self-attention mechanism is used to enhance useful features,suppress useless features and improve audio recognition accuracy.The experimental results show that the two-layer Stack Self-attention network model has the best performance.At this time,the average accuracy and F1-score of audio recognition are 92.3%and 0.9236,respectively.In the image feature extraction network,the Cutout data enhancement strategy is used to increase the diversity of training data and improve the generalization ability of the model;the CBAM module is introduced into the basic network ResNet50 to enhance the attention to the feature space position and different feature channels,and then Improve the recognition performance of network models.The experimental results show that the ResNet50 network with Cutout and CBAM has the best performance.At this time,the average accuracy and F1-score of image recognition are 92.43%and 0.9241,respectively.(2)In order to use the correlation information between different modalities and correctly measure the conflict information between different modal data,this paper introduces the confidence Hellinger distance and Shannon entropy to carry out weighted pretreatment on the BPA function,reduce the weight of unreliable BPA function,and weaken its influence on the fusion result.First,use the total probability formula to construct the BPA function required for the fusion of D-S evidence theory;use the confidence Hellinger distance to correctly measure the degree of conflict between the BPA functions,and use the Shannon entropy to quantify the amount of information in the BPA function;then,according to the degree of conflict of the BPA function and The amount of information constructs the weight factor and weights the BPA function;finally,it is fused through the Dempster combination rule of the D-S evidence theory.The D-S evidence theory fusion method based on the weighted BPA function is used to perform decision-level fusion on the local decision results of the audio and image features extracted from the research content(1).The average accuracy and standard deviation of the recognition are 95.06%and 0.0763,respectively,compared with before improvement The D-S evidence theory method,the average accuracy rate increased by 1.76%,and the standard deviation decreased by 0.0019.
Keywords/Search Tags:Multimodal Data Fusion, Attention Mechanism, D-S Evidence Theory, Conflict BPA Function Fusion
PDF Full Text Request
Related items