| With the booming development of cross-modal image and text retrieval,this technique has been utilized into various scenes,such as news auxiliary-picture,image captioning,product search of e-commerce,visual question answering,etc.In the field of recipe,benefiting from the large-scale data set published by those pioneering scholars,cross-model retrieval between recipe image and text now attaches great attention.Because of those complex instruction steps and elements such as ingredients in recipe,the text became much longer than before.However,the food image can not display the instruction steps and ingredients used,which makes the cross-model retrieval tasks between recipe image and text are much more difficult than traditional word-level and sentence-level cross-modal retrieval tasks.Moreover,the existing image-text cross-modal retrieval tasks on food recipe ignore the importance of instruction steps and elements.Therefore,the research on cross-modal retrieval of this thesis will be carried out from the perspective of the importance of instruction steps and elements.Regarding the above points,this thesis puts forward methods for cross-modal retrieval between recipe image and text based on the importance of instruction text,the main content is as follows:(1)This thesis investigates related works on cross-modal retrieval between recipe image and text,and analyses the characteristics of those baseline models.The results of the experiments using Recipe1M data set indicate that Transformer works better than other models as a modality-match based retrieval method,which provides a technical basis for future research in this thesis.(2)Considering that instruction steps in recipe shall have different importance,a cross-modal retrieval method called MMSR based on modality matching is proposed in this thesis.This method uses a multi-headed self-attention mechanism to concentrate on those important steps in recipe,so as to get the feature representation of instruction text.On the image side,the same mechanism is also used to focus on the visual area related to the instruction steps in recipe.This method maps the recipe text and food image into multi-layer subspaces,so as to facilitate the modality alignment in the common semantic space.At last,it calculates the matching degree between the representation of recipe text and food image.The experimental results indicate that our method outperforms several baseline models.(3)Considering that instruction elements in recipe shall have different importance,this thesis proposes a cross-modal retrieval method based on key instruction elements called MMSR~+,which is on the basis of MMSR.In the process of text representation learning,firstly,this method extracts instruction elements from instruction text through element extraction module.Secondly,the result of element representation learning module and element importance-calculation module will be weighted,so as to generate the feature representation of key instruction elements.Then,the feature representation of key instruction elements and text feature representation at sentence-level will be optimized by self-attention mechanism.At last,both text and image feature will be mapped into the common semantic space to facilitate the modality alignment.The experimental results indicate that this method further improves the quality of retrieval. |