Font Size: a A A

Multimodal Representation Learning Method And Its Applications

Posted on:2024-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z W DingFull Text:PDF
GTID:2568306932954919Subject:Data Science (Mathematics)
Abstract/Summary:PDF Full Text Request
With the increase of information medium,the types of modalities that can be collected are increasingly rich,such as facial expression photos,medical texts,and interview recordings.It is worth studying to extract key information from these multimodal data for task prediction.The purpose of multimodal representation learning is to integrate multimodal information to improve the prediction performance of the model.It is widely used in machine learning tasks.However,the existing multimodal representation learning methods have the following problems in the application of infectious disease prediction,electronic medical record analysis,and sentiment analysis:(1)Most infectious disease models ignore the impact of population movement on virus transmission,and these models can only predict the number of confirmed cases but fail to estimate the number of infected cases.(2)Previous studies have used pairwise modal interaction to capture the long-distance dependence of modalities.For example,in interview videos,respondents convey negative emotions by combining facial expressions with positive words spoken at another moment.However,the previous models have high complexity and poor robustness when using modal interaction to realize information association.(3)Under the condition of incomplete modality,the random missing of data in the training phase will lead to semantic sparsity.Most models focus on the missing semantic information through the cycle consistent loss and reconstruction loss,but do not make full use of the complementary information of modalities to solve this problem.Therefore,to solve the above problems,this paper adopts the following methods respectively:using the multimodal feature concatenation method to predict the number of confirmed and infected cases,designing an attention mechanism to depict the longdistance dependence of modalities,and using the modal interaction and structure of denoising autoencoder to focus on the semantic information of missing data.The details are as follows:(1)A multimodal representation model,BPISI-LSTM,based on feature concatenation is proposed.The community mobility data and daily cases are used as inputs of the model.The model integrates the back-projection algorithm and long short-term memory neural network,which can not only predict the number of confirmed cases but also estimate the number of infected cases.Compared with the data-driven model or dynamic model,the proposed model provides more accurate short-term and long-term prediction due to the combination of propagation mechanism and multimodal data.(2)A star graph-based interaction representation model,SGIR,is proposed.The model constructs the star graph representation of modalities,and then captures the long-distance dependence of modalities through the attention mechanism based on the star graph.Experimental results on sentiment analysis datasets and electronic medical record analysis datasets show the superiority of SGIR model.The number of modal interactions increases linearly with the number of modalities,which is computationally efficient compared with pairwise interaction models.The indirect modal interactions ensure the robustness of SGIR to noisy modalities.(3)A star graph-based encoder and reconstruction network,SGER-Net,is proposed.For the case of random missing multimodal input,the model first represents the star graph and performs modal interaction during training,and then performs weighted fusion of features before and after interaction,and finally performs mask reconstruction and feature reconstruction.Experimental results on the sentiment analysis dataset show that SGER-Net not only inherits the ability of SGIR to capture complementary information between and within modalities,but also pays attention to the semantic information of missing data under incomplete modalities.In summary,focusing on the topic of multimodal representation learning,we explore the complementary information of modalities in three applications scenarios using different methods,such as feature concatenation,modal interaction,and modal reconstruction,so as to improve the prediction accuracy,computational efficiency and robustness of the model.
Keywords/Search Tags:Multimodal learning, Representation learning, Deep learning, Infectious disease prediction, Sentiment analysis, Electronic medical record analysis
PDF Full Text Request
Related items