| Object detection based on natural language description is an important research direction in visual language tasks,with the aim of identifying the corresponding target regions in the image based on the input language description.This task involves both visual object detection and natural language understanding,playing a crucial role in understanding the current massive multimodal data and fully mining effective information.Currently,the mainstream algorithms for this task independently extract visual and textual features,which results in fixed visual features that cannot adapt to language descriptions.However,in reality,the same target can have different descriptions corresponding to it.The key problem is how to use language descriptions to guide the extraction of visual features and obtain visual features that are consistent with language features.Therefore,this article constructs a dynamic attention module based on language features to guide the extraction of visual features,thereby ensuring the consistency between visual features and language descriptions and enhancing the discriminability of target region features.At the same time,considering the importance of multi-scale features for detection,the dynamic attention module is used to complete the interaction between multi-scale feature levels guided by textual features,thereby selectively collecting multimodal features corresponding to different scales in the image.From the feature visualization results,it can be seen that the proposed dynamic attention module can extract adaptive visual features,and the accuracy has been improved on multiple standard datasets.The detection performance of the current algorithm is greatly limited in the face of long language description input.The number of words in a long sentence is large,and the effective information in the sentence needs to be accurately extracted,and the complex relationship between multiple words or objects involved in a long sentence needs to accurately model the context information.Therefore,this paper proposes a multimodal feature fusion method based on graph convolution context information modeling,which establishes the context relationship between modes and within modes by constructing a graph structure;So as to fully perceive the connection between the objectives;Deeply understand the complex semantics in image and language description.The multi-mode context information is used to guide the process of multi-mode feature fusion,and finally the multi-level hole convolution is used to enhance the multi-mode feature;Perception of semantic information in a wider range;Get more discriminative multimodal features.The algorithm proposed in this paper has achieved significant performance improvements on multiple standard datasets. |