| Visual grounding is a basic task in the field of cross-modality,involving two fields of natural language processing and computer vision.It is one of the current research hotspots in the field of cross-media intelligence.In the early stage,the recognition categories of the general visual grounding model were limited to the number of object categories seen in the training set,and it was difficult to popularize and apply in practical scenes.To deal with this problem,researchers propose a zero-shot grounding task,which requires the model to find the corresponding objects in the image that are unseen in the training set based on the debut query,which is more challenging.To avoid the category limitation caused by the proposal generation module,the zero-shot grounding model has deleted this module,which generally includes three main steps:cross-modality feature extraction,cross-modality feature alignment and fusion,and object localization.This paper focuses on the above three steps to carry out research,the main contents include:1.Aiming at the problems of insufficient visual-linguistic feature extraction and ineffective fusion of cross-modality information in the existing zero-shot grounding methods,a zero-shot grounding model based on multi-granularity cross-modality feature learning is proposed(MGCMFL),the model includes a multi-granularity language feature extraction module,a weighted bidirectional fusion visual feature extraction module,and a multi-level cross-modality feature fusion module.The multi-granularity language feature extraction module can extract language representations of phrase-level and word-level granularity to enrich language features,and help the model to infer and locate unseen objects through the semantic features of important modifiers common in the training set and test set.The weighted bidirectional fusion visual feature extraction module can extract and bi-directionally fuse multi-scale visual features,to obtain visual features with richer meanings.The multi-level cross-modality feature fusion module fuses phrase-level and word-level granularity language features with visual features,and then fuses them with position features again to obtain fusion features related to positions,which provide more effective feature input for subsequent object localization modules,so as to improve the accuracy of object identification and grounding.This paper conducts experiments on four public zero-shot grounding datasets,Flickr-Split-0,1 and VG-Split-2,3.The experimental results show that the accuracy of the MGCMFL model is improved by 1.28%-2.70% compared with the best benchmark model,thus proving the effectiveness of the MGCMFL model.2.Aiming at the problem that the existing zero-shot grounding model is difficult to accurately align language and visual modality information and the classification loss calculation process is disturbed by category noise information,a language-guided vision-based zero-shot grounding model with noise reduction loss(NRLLGV)is proposed,which includes a language-guided vision module and a noise reduction loss module.The language-guided vision module extracts the visual feature weights associated with language features through the attention mechanism,actively guides the visual feature extractor,strengthens the visual feature regions associated with the language features,and realizes the alignment of language modality information and visual modality information.The noise reduction loss module gives the noise samples a larger classification loss by designing a weighted focal loss and improves the performance of the zero-shot grounding model.This paper conducts experiments on four public zero-shot grounding datasets,Flickr-Split-0,1 and VG-Split-2,3.The experimental results show that the accuracy of the NRLLGV model is improved by0.60%-1.15% compared with the variant model of MGCMFL.The effectiveness of the NRLLGV model is demonstrated. |