| 3D human pose estimation refers to locating the 3D coordinates of human joint points from images or videos.Aiming at the inaccurate prediction caused by occlusion and complex pose,this paper explores the use of human body topology and time information to improve the effect of 3D human posture estimation.The main work is summarized as follows:Aiming at the problem of inaccurate 3D pose estimation due to occlusion and ambiguity,this paper proposes to use the prior information of the human pose topology,combined with the network structure of the graph convolution and attention module,and use the attention module to extract the global pose information.The graph convolution captures the spatial constraint information between adjacent joint points and strengthens the influence between adjacent joint points.Finally,the pose representation is regressed to the3 D pose space through a linear layer,and the 3D human pose is obtained.Aiming at the jitter problem of single-frame prediction in the temporal dimension,this paper uses temporal and spatial information to construct a frame-level progressive aggregation network based on the spatiotemporal Transformer,and uses a spatial encoder to model the relationship between human joints in each frame in the video.The pose representation with temporal information is obtained through a temporal encoder,the local temporal information is aggregated by strided convolution,the sequence length is gradually reduced,and finally,the network is focused on predicting the 3D pose of the intermediate frames of the video.Aiming at the problem of occlusion and information loss when extracting spatiotemporal information separately,this paper proposes to add more spatial constraint information,construct a spatiotemporal graph attention network,and use attention to model global spatial information for spatial information extraction,improve the adjacency matrix in graph convolution,increase local spatial information constraints on kinematic connections and symmetry,highlight the role of local information in estimating the pose of occluded parts,and use temporal convolutional networks to model in the temporal dimension.In order to reduce the loss of space-time information,an interleaved network is constructed using temporal convolution and graph attention modules,and finally,the network is used to predict 3D poses.In order to verify the effectiveness of the method in this paper,quantitative and qualitative experiments are carried out on the public datasets Human3.6m and Human Eva.The experimental results show that compared with other similar methods,the model constructed in this paper significantly improves the accuracy of prediction. |