Research And Implementation Of Multimodal Representation Learning By Image-Text Fusion

Posted on:2024-03-10

Degree:Master

Type:Thesis

Country:China

Candidate:G L Wang

Full Text:PDF

GTID:2568306944962699

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Multimodal representation learning of image-text utilizes the correlation and complementarity between modalities,and combines techniques such as feature fusion and alignment to learn j oint or coordinated representations for specific tasks,obtaining more comprehensive semantic features.However,there exist issues in the multi-modal representation learning of image-text,such as difficulty in measuring the semantic similarity between modalities,coarse alignment granularity and feature disparity.Existing research methods for these issues can mainly be classified into two categories:the first category mainly incorporates information such as relative position of regions in images and syntax structure of sentences into feature vectors,but these methods cannot effectively represent semantics such as attributes and relationships.The second category utilizes intelmodal fusion to achieve fine-grained alignment between regions in image and words in text,but lacks interaction between global information across modalities.To address the aforementioned issues,the paper proposes a multimodal network that combines scene graph and context fusion,which enhances node features with scene graph and global feature vectors to solve the problem of ambiguous correspondences.Additionally,the paper proposes a hybrid fusion network based on multimodal scene graph to improve the efficiency of modal fusion.The work presented in this paper can be divided into three main parts:(1)Propose a multimodal network that integrates scene graphs and context information.The paper uses scene graphs to extract semantic information such as objects,attributes,and relationships from the image and text,and propose the hierarchical attention fusion algorithm based on scene graphs to incorporate this semantic information into single-modal feature vectors.Next,the paper models the overall feature information of each single modality using global feature vectors and generates context vectors to guide the cross-modal local fusion process of image and text.(2)Propose a method to conduct multimodal scene graph and its corresponding hybrid fusion network.This paper constructs multimodal scene graph by single-modal scene graphs using rules.Then,Graph Transformer is utilized on the multi-modal scene graph to achieve finegrained hybrid fusion within and between modalities.(3)Validate the feasibility and effectiveness of the proposed model and algorithm in this paper through experiments.This paper compares the proposed model with other models such as MMCA and GraDual through experimental evaluations,and achieves performance improvements of 0.5%-17.9%,1.5%-9%,0.4%-4.4%,and 0.2%-19.5%on the R@1,R@5,R@10,and rSum metrics for the downstream task of image-text crossmodal retrieval.The effectiveness of the proposed model is further validated through experiements such as ablation studies and parameter sensitivity analyses.

Keywords/Search Tags:

Multimodal Representation Learning, Scene Graph, Hierarchical Attention Fusion, Global Feature Vector, Image-Text Cross-modal Retrieval

PDF Full Text Request

Related items

1	Image-Text Cross-Modal Retrieval Method Based On Scene Graph
2	Research On Cross-Modal Image-Text Retrieval Techniques Based On Semantics And Common Sense
3	Research On Hierarchical Supervised Cross-modal Image And Text Retrieval Based On Deep Hashing
4	Researches On Cross-Modal Learning Algorithms For Image-Text Retrieval
5	Visual-textual Cross-modal Retrieval Based On Multimodal Information Interaction
6	Research On Content Sifting And Storage Mechanism Of Cross-modal Image And Text Data Based On Semantic Similarity
7	Research And Application On Cross-Modal Retrieval Methods For Image-Text
8	Research On Text-Image Cross Modal Retrieval Method
9	Research And Application Of Image Similarity Measurement Algorithm Based On Multimodal Fusion
10	Research And Implementation Of Cross Modal News Retrieval System Based On Deep Learning