Font Size: a A A

Research And Implementation Of Multimodal Representation Learning By Image-Text Fusion

Posted on:2024-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:G L WangFull Text:PDF
GTID:2568306944962699Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multimodal representation learning of image-text utilizes the correlation and complementarity between modalities,and combines techniques such as feature fusion and alignment to learn j oint or coordinated representations for specific tasks,obtaining more comprehensive semantic features.However,there exist issues in the multi-modal representation learning of image-text,such as difficulty in measuring the semantic similarity between modalities,coarse alignment granularity and feature disparity.Existing research methods for these issues can mainly be classified into two categories:the first category mainly incorporates information such as relative position of regions in images and syntax structure of sentences into feature vectors,but these methods cannot effectively represent semantics such as attributes and relationships.The second category utilizes intelmodal fusion to achieve fine-grained alignment between regions in image and words in text,but lacks interaction between global information across modalities.To address the aforementioned issues,the paper proposes a multimodal network that combines scene graph and context fusion,which enhances node features with scene graph and global feature vectors to solve the problem of ambiguous correspondences.Additionally,the paper proposes a hybrid fusion network based on multimodal scene graph to improve the efficiency of modal fusion.The work presented in this paper can be divided into three main parts:(1)Propose a multimodal network that integrates scene graphs and context information.The paper uses scene graphs to extract semantic information such as objects,attributes,and relationships from the image and text,and propose the hierarchical attention fusion algorithm based on scene graphs to incorporate this semantic information into single-modal feature vectors.Next,the paper models the overall feature information of each single modality using global feature vectors and generates context vectors to guide the cross-modal local fusion process of image and text.(2)Propose a method to conduct multimodal scene graph and its corresponding hybrid fusion network.This paper constructs multimodal scene graph by single-modal scene graphs using rules.Then,Graph Transformer is utilized on the multi-modal scene graph to achieve finegrained hybrid fusion within and between modalities.(3)Validate the feasibility and effectiveness of the proposed model and algorithm in this paper through experiments.This paper compares the proposed model with other models such as MMCA and GraDual through experimental evaluations,and achieves performance improvements of 0.5%-17.9%,1.5%-9%,0.4%-4.4%,and 0.2%-19.5%on the R@1,R@5,R@10,and rSum metrics for the downstream task of image-text crossmodal retrieval.The effectiveness of the proposed model is further validated through experiements such as ablation studies and parameter sensitivity analyses.
Keywords/Search Tags:Multimodal Representation Learning, Scene Graph, Hierarchical Attention Fusion, Global Feature Vector, Image-Text Cross-modal Retrieval
PDF Full Text Request
Related items