| 3D human pose estimation is a downstream task in the field of target detection,aiming to detect the human body position and 3D coordinates of each key point of the human body from the input image,and connect the detected key points into a human skeleton model to describe the 3D human pose in the image,which is important for the development of many fields such as virtual reality,security,and 3D reconstruction.However,it is a highly nonlinear problem from 2D images to output 3D human node coordinates with depth ambiguity problem,i.e.,a 2D pose often corresponds to multiple sets of 3D human coordinates,leading to challenging 3D human pose estimation using monocular images.In this paper,we propose two Transformer-based multi-hypothesis interaction algorithms to improve and innovate the problem of poor Transformer feature extraction and multiple solutions in 3D human pose estimation tasks.Specifically,the main research contents and contributions of this paper are as follows:(1)To address the problems of depth ambiguity and the weak feature extraction ability of traditional Transformer,this paper introduces the concept of multi-hypothesis and innovates the internal structure of Transformer,and proposes EMHIFormer:An Enhanced Multi-Hypothesis Interaction Transformer for 3D Human Pose Estimation in Video.Specifically,firstly,the output of each Transformer layer is treated as a hypothesis,and connections are established between different Transformer layers so that the current network layer can integrate the spatial information of the output of the previous layer to build a more comprehensive and rich hypothesis.Subsequently,the hypothesis interaction refinement module is used to extract the temporal relationship between frames while fusing the information of the previous hypothesis,introducing the correlation between hypotheses.Immediately after that,the cross-hypothesis interaction module is used to complete the information interaction and fusion among hypotheses.Finally,with the aid of an enhanced regression head,the channel weights are adaptively adjusted to derive the final 3D human pose.(2)Based on the contribution(1),for the problem that most of the current algorithms isolate the intrinsic connection of time-space information,this paper changes the combination of time-space Transformer and proposes MHPFormer:A Multi-Hypothesis Parallel Transformer for 3D Human Pose Estimation in Video.Specifically,the feature and processing module is first used to extract the underlying time-space features.Immediately after that,the spatial-time-space parallel Transformer structure is used to further enhance the time-space feature extraction capability of the backbone network.Then,the upper and lower layers of each parallel Transformer are considered as two sub-hypotheses,and the feature information between the two subhypotheses is fully fused by concat and fully connected operation to obtain an output with rich hypothesis information.The hypotheses are then fused using a lightweight hypothesis interaction module composed of a certain number of learnable parameters to further enhance the correlation between the hypotheses.Finally,in order to alleviate the problem of long training time due to parallel Transformer,the model input and output are improved by replacing the traditional mode of multiple frames input and intermediate frames output with multiple frames input and multiple frames output to reduce the redundant computation and significantly reduce the training time. |