| Nowadays,widespread attention is drawn to cross-modal recipe retrieval due to the various food-relevant applications and the in-creasing concern on health.This task is addressable through a combination of multi-modal data(e.g.,images and texts),which have far-reaching meaning on the merging of vision and language.Early researchers focus on learning joint representation by projecting food images and recipe texts(e.g.,ingredients and instructions)to the same embedding space and proposing different cross-modal fusion structure.Recently,most methods adopt a pre-train model and finetune strategy to help capture the alignment between modalities.While offering appreciable retrieval performance,three limitations still exist in these methods: 1)with the increasing complexity of the pre-trained model,the data requirements and the cost of calculating in fine-tune stage are also rising.2)the downstream fine-tune tasks they designed for cross-modal recipe retrieval have a gap with the pre-trained model.And 3)the underlying difference between data in the same modality is neglected and the trilinear interaction among the three inputs is implicitly captured.To this end,we propose a novel fusion framework named Trilinear FUsion Network(TFUN)to utilize high-level associations between these three inputs simultaneously and learn an accurate cross-modal similarity function via bi-directional triplet loss explicitly,which is generic for the recipe retrieval task.To reduce the model complexity,we introduce the advanced method of tensor decomposition to ensure computational efficiency and accessibility.We develop a three-stage hard triplet sampling scheme to ensure fast convergence.We also propose a framework named Prompt Based Learning Framework(PBLF)to adopt transferable visual model CLIP(Contrastive Language-Image Pre-training)into the recipe retrieval task for the first time,and design an appropriate Prompt to train the model efficiently,which bridge the gap between the pre-trained model and the downstream task and transfer the knowledge of the CLIP model to specific recipe retrieval task.The extensive experiments on the large-scale cross-modal recipe dataset Recipe1 M demonstrate the superiority of our proposed model compared to the state-of-the-art approaches. |