| With the continuous increase of the world’s population,the phenomenon of crowd gathering in different public places is becoming more and more common.In these crowded scenarios,counting and monitoring the flow of people is crucial for the prevention of safety accidents.In addition,crowd gathering will accelerate the spread of certain diseases,and monitoring and analyzing of high-density crowds can keep the distance between people within a reasonable range,which can effective prevent the spread of diseases.The crowd counting task aims to estimate the number and density of crowds in different scenes and is an important means to analyze and supervise dense crowds,which is of great significance for public safety,video surveillance,and traffic control.The rapid development of deep learning has promoted the advancement of crowd counting methods,and many methods have made good progress in counting accuracy and speed.However,scale variation and background interference in complex scenes will affect the counting performance of the model in practical applications,and achieving accurate crowd counting still faces serious challenges.Therefore,in this thesis,we conduct relevant research and propose solutions to the problems in the crowd counting task.The main research work is as follows:(1)Aiming at the problems of scale variation and background interference in images,this thesis proposes a Hierarchical Feature Aggregation Network with Semantic Attention.First,a Semantic Attention module is designed at the end of the backbone network,which increases the attention to the crowd region and reduces the interference of background noise by employing an attention mechanism on the shallow feature maps extracted by the network.Second,the network utilizes convolutional kernels of different sizes and Global Average Pooling operations to extract multi-scale features and global contextual features,respectively.Then,the extracted different features are progressively fused in the network through Feature Aggregation modules,which can make full use of the complementary properties between low-level and high-level features to achieve effective integration of features at different levels.Finally,the counting performance of the method is evaluated on several challenging crowd counting datasets,and the experimental results show that the method can effectively aggregate multi-level features to achieve accurate counting in crowded scenarios.(2)Aiming at the problem that convolutional neural networks have limited receptive fields and cannot effectively model global context,this thesis proposes a Two-branch Feature Fusion Network based on CNN and Transformer on the basis of Hierarchical Feature Aggregation Network,which introduces Transformer encoder at the front-end of the model to capture the dependencies between long-range contextual information and enhance the ability of the model to model the global context,thus improving the counting performance.First,the network employs two parallel branches,VGG-16 and Transformer,as the front-end network to extract semantic features with global contextual information while extracting local detailed features.Then,the Feature Dual-fusion module utilizes multiplication operations to fuse local features and global features hierarchically.Further,to reduce the effect of background noise,the Density Guided Regression module enhances the differences between different density regions of the feature maps by generating attention masks and generates the density map,which highlight the target crowd region in the cluttered background.Finally,the counting performance of the method is evaluated on several datasets and compared with other methods.The experimental results demonstrate that the method can effectively model the global context and improve the counting accuracy of the model. |