| 3D human pose estimation can characterize the human body intuitively and clearly,and it is widely used in motion analysis,medical reconstruction and other fields,with extremely important research significance and industrial application value.The use of multi-view input can effectively supplement the information from multiple angles of the human body in the scene,solving the adverse effects of occlusion,uneven lighting,etc.caused by the shooting angle.This thesis focuses on 3D human pose estimation in multi-view scenarios to further improve the accuracy,robustness and generalization of multi-view 3D human pose estimation algorithms,and investigate multi-view 3D human pose estimation from both single person and multi-person aspects.The main research points of this thesis are as follows:In multi-view 3D single human pose estimation,in order to solve the problems of current methods such as insufficient information mining of the view itself,current 2D pose estimation results of different views are fused with the same weights and affect the final 3D results,this thesis proposes a 3D single human pose estimation method based on visual Transformer,which introduces a self-attention mechanism with position embedding in the feature extraction stage with Long distance dependence is introduced in the feature extraction stage by self-attention mechanism with position embedding,and the final result is constrained by using human structure,and the fusion weights are adaptively adjusted according to different 2D pose feature quality in the feature fusion stage.The results on the publicly available dataset Human 3.6M show a 5%improvement in each metric compared to the current mainstream methods.In multi-view 3D multi-person pose estimation,in order to deeply mine multi-view feature information and simplify the process of joint feature cross-view matching,this thesis proposes a 3D multi-person pose estimation method based on cross-view joint coding,which focuses on features from other views by representing each feature point as a learnable positional embedding,while encoding the body-joint-view;the feature information fusion module is based on pairwise polar geometry and triangular dissection are implemented;the multi-person pose regression module makes confidence judgments on joint point projections through convolutional networks based on the localization and joint grouping results from the predecessor module,while constraining the spatial geometric relationships based on the view inputs.Extensive experiments on publicly available datasets show that the proposed method achieves comparable results with mainstream methods. |