| With the development of computer science and technology,especially in the current era where the amount of image and text information is increasing rapidly,multimodal retrieval and analysis,as one of the hottest topics,are also the focus of the field of image and text matching and are favored by many scholars.In the past few years,how to effectively narrow the semantic gap between visual models and text models and accurately evaluate the semantic similarity between images and text has become a hot research topic for image and text matching tasks.Although there are many related works available,there are still limitations:1)a considerable portion of the work has to some extent overlooked the multi perspective description of visual information;2)The interference of semantic complexity on model retrieval and matching work.Therefore,faced with the aforementioned problems,it becomes very difficult to match a given image with multiple text representations,that is,align the two in the feature space.To address the above issues,this paper proposes a hierarchical multi-view image-text matching method(CAMERA++).At first,the method uses an adaptive gated self-attention mechanism to capture local visual area features and feature information in the text,adaptively mine context information,and control the flow of internal information flow from a more fine-grained feature level.Secondly,this method summarizes the local visual region features based on contextual information reinforcement from multiple perspectives in a hierarchical manner,and aggregates the features at the local visual region level to the features at the image level.After that,the method also uses diversity regularization in a hierarchical manner to reduce the information redundancy between hierarchical multi perspectives.Not only that,this method also considers the distribution of modal information in the feature space more,by fitting the theoretical distribution of modal feature space with the actual distribution to constrain the training process of the model and enhance its robustness.Finally,this method thaws some parameters in the pre trained BERT and fine tunes them to enhance the expressive ability of text branch features.The method proposed in this thesis has undergone sufficient and complete experiments on two publicly available large datasets to verify its authenticity and effectiveness. |