| Crowd counting,as one of the hottest directions in computer vision,aims to efficiently and accurately estimate the total number of people in an image and generate density maps that reflect the spatial distribution of the crowd.It has important applications in intelligent security,intelligent transportation,and other fields.However,in real-world scenarios,crowd counting models face challenges such as scale variations and background occlusion,which leave room for further improvement in accuracy and robustness.In recent years,attention mechanisms have been introduced into visual tasks to enable models to selectively focus on the most informative regions in an image,providing a new research approach to address the aforementioned issues.Therefore,this paper conducts an in-depth study on the existing problems of crowd counting by incorporating attention mechanisms,and the main content and innovations are summarized as follows:(1)A multi-scale attention fusion crowd counting model is proposed to fully integrate the detailed features extracted by low-level convolutions.The model consists of two subnetworks: scale attention extraction and multi-level fusion.The former leverages the characteristics of the encoder-decoder structure to extract detailed features and semantic information at different levels,forming scale attention.The latter integrates the scale attention with the corresponding convolutional feature maps to eliminate scale variations and background occlusion,thereby generating high-quality density maps.In particular,the feature map resolution of crowd images significantly decreases after multiple pooling operations,resulting in the loss of spatial information,which is not conducive to density map generation.To address this issue,transpose convolutions are introduced to reconstruct the spatial information of the feature maps to meet the requirements of density map generation.Extensive experimental results on multiple datasets demonstrate that the proposed model effectively recognizes scale variations and background occlusion,exhibiting excellent counting performance in sparse and dense crowd scenes.(2)A multi-scale guided self-attention crowd counting model is proposed to model the visual mechanism of humans in recognizing scale variations and background occlusion through global and local contrasts.First,crowd images are scaled to different resolutions using convolutional structures and then inputted into a Transformer.This approach enables the model to extract local features while considering global contextual information,achieving information interaction between the global and local levels.Second,the feature maps scaled by convolutional structures contain a large number of pixels,resulting in a massive parameter count that affects training efficiency.To address this issue,the dimensionality of the feature maps is reduced to match the number of pixels in the input image.Finally,a fixed-scale Transformer limits the model’s ability to judge scale variations in crowds.Therefore,this paper instantiates a multi-scale Transformer by applying the idea of scaling the original image.Extensive experimental results demonstrate that considering global information significantly improves the counting accuracy of the model.Compared with other crowd counting models,the proposed model shows clear advantages in superdense crowd scenes. |