Research On Recipe Retrieval Method Based On Multi-modal Information Fusion

Posted on:2024-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:J L Sun

Full Text:PDF

GTID:2531307079459494

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Nowadays,widespread attention is drawn to cross-modal recipe retrieval due to the various food-relevant applications and the in-creasing concern on health.This task is addressable through a combination of multi-modal data(e.g.,images and texts),which have far-reaching meaning on the merging of vision and language.Early researchers focus on learning joint representation by projecting food images and recipe texts(e.g.,ingredients and instructions)to the same embedding space and proposing different cross-modal fusion structure.Recently,most methods adopt a pre-train model and finetune strategy to help capture the alignment between modalities.While offering appreciable retrieval performance,three limitations still exist in these methods: 1)with the increasing complexity of the pre-trained model,the data requirements and the cost of calculating in fine-tune stage are also rising.2)the downstream fine-tune tasks they designed for cross-modal recipe retrieval have a gap with the pre-trained model.And 3)the underlying difference between data in the same modality is neglected and the trilinear interaction among the three inputs is implicitly captured.To this end,we propose a novel fusion framework named Trilinear FUsion Network(TFUN)to utilize high-level associations between these three inputs simultaneously and learn an accurate cross-modal similarity function via bi-directional triplet loss explicitly,which is generic for the recipe retrieval task.To reduce the model complexity,we introduce the advanced method of tensor decomposition to ensure computational efficiency and accessibility.We develop a three-stage hard triplet sampling scheme to ensure fast convergence.We also propose a framework named Prompt Based Learning Framework(PBLF)to adopt transferable visual model CLIP(Contrastive Language-Image Pre-training)into the recipe retrieval task for the first time,and design an appropriate Prompt to train the model efficiently,which bridge the gap between the pre-trained model and the downstream task and transfer the knowledge of the CLIP model to specific recipe retrieval task.The extensive experiments on the large-scale cross-modal recipe dataset Recipe1 M demonstrate the superiority of our proposed model compared to the state-of-the-art approaches.

Keywords/Search Tags:

Cross-modal Retrieval, Image-recipe Retrieval, Multi-modal Fusion, Prompt Learning

PDF Full Text Request

Related items

1	Research On Cross-modal Recipe Retrieval Based On Multimodal Interaction
2	Research On Image Retrieval Algorithm Of Plaid Fabric Based On Deep Learning
3	Research On Related Technology Of Cross-scenario Clothing Image Retrieval
4	Research On Garment Image Retrieval Technology Based On Deep Learning And Traditional Features
5	Research On Cross-domain Clothing Retrieval Method Based On Deep Learning
6	Research On Clothing Image Retrieval Based On Attention Mechanism And Feature Fusion
7	Research On Clothing Image Retrieval Based On Multi-feature Fusion Of Topology
8	Clothing Image Classification And Retrieval Based On Deep Learning
9	Research On Clothing Image Retrieval Based On Deep Learning
10	Application Of Machine Learning Image Retrieval Technology In Clothing Image Search