| Person re-identification aims to address the problem of pedestrian retrieval across different cameras.Cross-modality person re-identification is a hot research topic at the interface of vision and language,which has practical applications in video surveillance,public safety,and intelligent transportation.To make the most of the information from both image and text modalities,and mitigate the impact of variations in viewpoint,occlusion,and surrogate positive samples under complex scenarios.This dissertation proposes an end-toend approach from the perspective of robust feature extraction,focusing on three aspects:multi-scale information aggregation,local bias and salient information enhancement,and inter-class information enhancement.Experiments were conducted on three widely used datasets.The main research work is described as follows:(1)To address the frequently encountered problem of viewpoint variation,a crossmodality person re-identification method based on multi-scale information aggregation is proposed.In terms of visual modality,the method leverages coordinate attention in images to embed position information in channel information and a multi-scale feature fusion method to fuses visual features of different receptive fields.To better capture local information in text,the method introduces self-attention and convolutional mixing modules to concatenate contextual information and local receptive field text features,thereby alleviating the different representations of the same object in images and text modalities from different viewpoints.Comparative experiments,ablation experiments and visualization results demonstrate that the proposed method mitigates the effect of viewpoint variation and increases the accuracy of cross-modality person re-identification.(2)To address the problem of incomplete information brought on by occlusion,a crossmodality person re-identification method is based on local bias and salient information enhancement.Local bias is produced by randomly erasing image and text data.The residual channel spatial attention module is utilized in the salient information extraction stage to recalculate the weights of channels and space for image and text features in order to suppress occlusion-related interference.Comparative experiments,ablation experiments and visualization results demonstrate that the proposed method effectively reduces the impact of occlusion,and improves the accuracy and reliability of cross-modality person re-identification.(3)To address the problem of surrogate positive samples brought on by the disparities in semantic richness between pedestrian images and text descriptions,a method based on inter-class information enhancement is proposed.This method introduces external attention in the image and text branches,and uses external memory storage units to implicitly learn the information differences between different samples.Additionally,residual attention is introduced at the output end of the image branch to capture fine-grained feature information.Comparative experiments,ablation experiments and visualization results show that this method can learn inter-class information differences,expand inter-class distances,and effectively distinguish surrogate positive samples by capturing fine-grained feature information in pedestrian images.In summary,research on the problems of variations in viewpoint,occlusion,and surrogate positive samples under complex scenarios has been conducted,and experiments and visualization results on three widely used cross-modality person re-identification datasets have demonstrated that the proposed methods can assist the model in extracting robust features in complex scenarios,effectively completing the cross-modality person re-identification task in such scenarios. |