| Large-scale crowd gathering scenarios pose more potential risks to public safety.Accurately counting the number of people in a dense scene and visualizing the spatial distribution of the crowd to provide real-time early warning to the authorities can effectively prevent abnormal events,which is crucial for security monitoring,traffic control and urban safety.In recent years,the application of deep learning techniques in crowd counting has significantly improved the performance of models,but still faces many challenges,including the diversity of crowd distribution between images and backgrounds,drastic changes in scale,and severe occlusions.To address these issues,this paper builds on previous research in crowd counting to further investigate the extraction of shallow features,feature fusion,and the generation of fine-grained density maps,combining deep convolutional networks,spatial pyramid pooling,and attention mechanisms to build deep neural network models for understanding and accurately counting packed large-scale crowd scenes,and generating fine-grained crowd density maps.The main work and innovations of this paper can be summarized as follows:(1)A crowd density estimation algorithm based on a spatial contextual feature fusion network.To utilize the correct context at the location of each crowd image and to take into account crowd attention information in order to more accurately predict the density map at the pixel level.The algorithm first selects the first ten convolutional layers of the VGG-16 network excluding the fully connected layer to extract the low-level features of the input image;then computes scale-aware features using rich convolution at different scales and adaptively encodes the scale of the contextual information required to accurately estimate the density map;afterward the fused feature maps are calibrated and re-fused by the channel space attention-aware module to ignore some background details to focus the model’s attention on the head region of the pedestrian;and finally the final crowd density estimation is performed by a null convolutional network.Comparative experiments on several publicly available crowd counting datasets demonstrate the ability of the algorithm to accurately predict crowd density while improving the effectiveness of feature fusion.(2)An initial implementation of a spatial contextual feature fusion network-based crowd density estimation algorithm(noted as: initial model)is optimized to propose an improved multi-scale contextual feature fusion network-based crowd density estimation algorithm.In order to enhance the adaptability of the model to drastic changes in the scale of crowd images,while overcoming the conflict between performance and complexity.Instead of setting the relative impact of each scale perceptual feature at each spatial location by learning a predictive weight map,the algorithm groups shallow features into four parallel blocks of different sizes to extract contextual feature information at different scales and injects channel dependence and spatial dependence into the feature map,thus enabling the model to focus on useful features while suppressing irrelevant context.Extensive experiments on several large and challenging crowd-counting datasets show that the proposed algorithm outperforms many recent state-of-the-art methods.Furthermore,the algorithm not only maintains its original performance in terms of robustness,accuracy and generalization,but also has lower time and space complexity than the initial model. |