| Human pose estimation is a key technology in the field of computer vision,which has been widely used in behavior analysis,pose tracking,autonomous driving,medical health and other fields.With the continuous development of deep learning,the accuracy of human pose estimation has been greatly improved.Existing deep human pose estimation networks face challenges such as large parameter size and high computational complexity,making it difficult to be effectively applied in practical scenarios.In video-based human pose estimation tasks,problems such as occlusion of key parts,rapid pose changes,and jitter between adjacent frames in the video data can interfere with the inference of keypoint locations.Fully utilizing the temporal information between video frames can help the model better understand contextual information.This paper studies a method for human pose estimation based on lightweight high-resolution network and video sequences.The main work includes:(1)A lightweight and high-resolution human pose estimation method based on attention mechanism is studied,which solves the problem of large parameter size,high computational complexity,and insufficient focus on key areas of the human body in deep convolutional networks.By introducing depth-wise separable convolution and using lightweight modules MBConv and Fused-MBConv instead of the original standard modules in the HRNet network,the model’s parameter size and computational complexity are reduced.Meanwhile,by using attention mechanism to capture the global information of feature maps,more accurate features are provided for detecting occlusion and small-scale keypoints,thereby compensating for the insufficient feature extraction ability caused by the reduction of parameters.Through comparative experiments on benchmark datasets MPII and COCO as well as visual experiments,it is verified that this lightweight and high-resolution human pose estimation method can effectively reduce the model’s parameter size and computational complexity while maintaining the performance of human keypoint estimation.(2)A time-constrained human pose estimation method based on video sequence is studied to solve the problem that position occlusion in video sequence data leads to attitude information damage and time dependence is not fully utilized.LSTM module is used to capture the long distance dependency,extract information from the time dimension,and model the relationship between contexts.In order to fully explore the supplementary information of adjacent frames,an attitude time constraint module was constructed,and the joint thermal map of each frame was fused to compress the search range of each key point to enhance the stability of node regression.Furthermore,the pooled pyramid module of void space was combined to improve the model’s ability to capture local micro motion information.Through comparative experiments on three video datasets,Penn Action,Sub-JHMDB,and PoseTrack,as well as ablation experiments,it is verified that the video-based human pose estimation method based on temporal constraints can effectively utilize the temporal information of videos to improve the accuracy of human pose estimation. |