| With the exponential growth of the world population and the outbreak of the COVID-19 pandemic,public safety and intelligent surveillance have gained new opportunities.As one of the key problems,the crowd counting problem has become an active research topic in the field of computer vision.The aim of crowd counting is to automatically estimate the number of people in a given image or video.Due to the powerful feature learning ability of convolutional neural networks(CNNs),pioneering CNN-based regressors have performed better than traditional methods in crowd counting in congested crowded scenes.However,crowd counting remains a challenging task due to the scale variations in people’s heads,the non-uniform distributions of crowd,and the difficulty of labeling.This thesis is devoted to the problems and challenges of crowd counting.The main research content of this thesis can be summarized in four-fold as follows:(1)To address the scale variations in people’s heads,a scale-aware convolutional neural network,called MMNet,is proposed for crowd counting.The proposed MMNet utilizes filters of different sizes to extract multi-scale features in the encoding stage,and fuses the multi-scale features generated by different stages in the decoding stage.Considering that the crowd density distribution information contains critical information about the head size of people,multi-level density-based spatial information is employed to supervise the fusion of multi-scale features in our proposed network.Experimental results demonstrate the effectiveness of our proposed MMNet compared to state-of-the-art methods on four benchmark datasets.(2)For the nonuniform density distribution problem,a novel convolutional neural network framework with mixed ground-truth is proposed for crowd counting,called top-6)relation-based network(TKRNet).First,an adaptive top-k relation module(ATRM)is proposed to enhance feature representations by leveraging the top-6)dependencies among the pixels with an adaptive filtering mechanism.Specifically,the similarity among pixels is calculated so as to select the top-6)relations for each position.Then,a weight normalization operation with an adaptive filtering mechanism is proposed to adaptively eliminate the influence of the positions with low correlation in the top-6)relationships.Finally,a weight attention mechanism is introduced to make the ATRM pay more attention to the positions with high weights in the top-6)relations.Specifically,the estimated density maps generated in a coarse-to-fine manner are treated as coarse locations for crowds so as to assist our TKRNet in regressing the scattered point-annotated ground truth.Extensive experimental results demonstrate the effectiveness of our proposed TKRNet on several public datasets compared to state-of-the-art methods.(3)Considering the high cost of location labels in crowd counting and the fact that those point-level labels are not taken into evaluation metrics,a transformer-based crowd counting method is proposed with only count-level supervision.Specifically,an adaptive Twins-based encoder is utilized to extract multi-level features,and a U-Net style decoder is leveraged to regress the crowd numbers in a coarse-to-fine manner.First,an adaptive consistency attention is incorporated into the locally-grouped self-attention in Twins,so as to improve the feature extraction ability of the Twins encoder by considering the nonuniform distribution of crowd.Second,a multi-level weakly-supervised loss is leveraged to assist the backpropagation of gradient as well as to reduce overfitting.Moreover,in the decoding process,the intermediate features supervised by count-level labels are fed into the network to fuse multi-scale features.Extensive experimental results on four public datasets demonstrate that our proposed method achieves superior performance in comparison to the state-of-the-art weakly-supervised methods and obtains competitive counting performance compared to fully-supervised methods.(4)Considering the characteristics of video-based crowd counting,a new cross locality relation network(CLRNet)for crowd counting is proposed in videos.Specifically,a cross locality relation module(CLRM)is proposed to enhance feature representation by modeling local dependencies of pixels between adjacent frames with an adapted local selfattention mechanism.First,different from existing methods which measure the similarity between pixels by scaled-dot product,a new adaptive cosine similarity is developed to measure the relationship between two positions.Second,traditional self-attention modules usually integrate the reconstructed features with the same weights for all positions.However,crowd movements and background changes in a video sequence are non-uniform in real-world applications.Consequently,it is inappropriate to treat all positions in reconstructed features equally.To address this issue,a scene consistency attention map(SCAM)is developed to make the CLRM pay more attention to the positions with strong correlations in adjacent frames.In addition,the CLRM is incorporated into the network in a coarse-to-fine manner in order to accommodate the scale variations in people’s heads.Experimental results demonstrate the effectiveness of our proposed CLRNet compared to state-of-the-art methods on five public video datasets. |