Research On Key Technologies Of Human Pose Prediction In Video Images

Posted on:2022-04-08

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Zhao

Full Text:PDF

GTID:1528307169977359

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The computer vision has made remarkable achievements in recent years,and there is an urgent need to understand human-centered visual information in military,life and education.Capturing people in the video and predicting their behavior is an important feature of understanding of the video.Supervised deep learning has achieved great results.However,the demand for labeled data and expensive computing power limits its wide application;at the same time,the current algorithm’s insufficient analysis of human body structure and timing dependence also reduces performance to a certain extent.Therefore,fully mining human body structure and timing information to improve performance,enhance generalization,and propose a real-time and energy-efficient hardware system is of great significance to promote the development and wide application of vision applications.This theis uses human video as the carrier,focusing on human body structure information and video temporal information,and researches human behavior prediction and video generation tasks.For a small amount of video-level supervised monocular human video,by using the image-level supervised information and the structural information of the human body model,a reconstruction model that can achieve good performance in the video of natural scenes is obtained.Taking human action recognition as an example,the problem of domain generalization in multi-source domain scenes is studied.Aiming at the Graog U-Net of the human body 3D reconstruction algorithm,a set of FPGA-based high-efficiency systems is designed to achieve real-time,low-energy,and high-efficiency human body reconstruction.The main work of this paper is as follows:2D Pose-forecasting aided human video prediction.Due to the unknown nature of future actions and the complexity of video details,human video prediction is still a very challenging problem.Recent methods tackle this problem in two steps: firstly to forecast future human poses from the initial ones,and then to generate realistic frames conditioned on predicted poses.Following this framework,we propose a novel Graph Convolutional Network(GCN)based pose predictor to comprehensively model human body joints and forcast their positions holistically,and also a stacked generative model with a temporal discriminator to iteratively refine the quality of the generated videos.The GCN based pose predictor fully considers the relationships among body joints and produces more plausible pose predictions.With the guidance of predicted poses,a temporal discriminator encodes temporal information into future frame generation to achieve high-quality results.Furthermore,stacked residual refinement generators make the results more realistic.Extensive experiments on benchmark datasets demonstrate that the proposed method produces better predictions than state-of-the-arts and achieves up to 15% improvement in PSNR.Human body reconstruction algorithm for 3D Pose from monocular video.Estimating the shape and posture of the complete 3D human body from monocular video is a challenging problem.Since real-world 3D mesh-labeled datasets are limited,most current methods in 3D human shape reconstruction only focus on single RGB images,losing all the temporal information.In contrast,we propose temporally refined Graph U-Nets,including a Graph U-Nets based image-level module and a Residual Temporal Graph CNN(Residual TG-CNN)based video-level module,to solve this problem.The image-level module regresses human shape and pose estimation from images,where the Graph Convolutional Neural Network(Graph CNN)helps the information communication of neighboring vertices,and the U-Nets architecture enlarges the receptive field of each vertex and fuses high-level and low-level features.The video-level module learns temporal dynamics from both structural and temporal neighbors.The temporal dynamics of each vertex are continuous in the temporal dimension and highly relevant to the structural neighbors,so it is helpful to diminish the ambiguity of the body in single images by fusing temporal dynamics.Our algorithm makes full use of labels from image-level datasets and refines the image-level results through video-level module.Evaluated on Human3.6M and 3DPW datasets,our model produces accurate 3D human meshes and achieves superior 3D human pose estimation accuracy when compared with state-of-theart methods.Multi-source domain generalization via Mahalanobis distance based classifier.Multi-source domain generalization(MSDG)is a topical problem that targets learning a model from multiple source domains to perform well on an unseen target domain.The existing methods simply model the decision boundaries based on the input features with a linear classifier.This could be a problem for DG as a learned linear decision boundary may overfit to the source domains without considering the class variations.Alternatively,in this paper,using the Bayes’ rule,we propose a Mahalanobias distance based classifier,parameterized by class-wise means and covariances,to model the conditional distribution of features given each class,leading to quadratic decision boundaries.First,we can directly encourage the learned distribution of each class to be compact,i.e.low intra-class variations.Second,we can optimize the model discrimination by maximizing the separability of each class’ associated feature distribution.To efficiently estimate the mean and covariance of each class during optimization,we maintain a weighted average of mean and covariance from each minibatch using exponential moving average.Our method is model-agnostic and applies to any base DG method.We incorporate our method with ERM and CORAL and demonstrate the new state-of-the-art performance on three popular DG benchmarks,Rotated MNIST,PACS and VLCS.FPGA based energy-efficient system for 3D human reconstruction.3D human reconstruction is a fundamental task for human-related applications where Graph Convolution Networks(GCNs)have shown promising performance.It is challenging to widely deploy it,as 1)expensive GPUs with high energy consumption are required,2)GCNs contain irregular data communication and human model based graph is different from general graphs,3)feature transformation demands intensive computation.To address these challenges,we propose a FPGA based energy-efficient system for 3D human reconstruction with GCNs.Specifically,first,we lightweight the model by focusing on walking situation and pruning the model by a hybrid method.Second,considering the characteristics of the human body adjacency graph,we design the Basic Inference Module(BIM)to formulate both the aggragation phase and the combination phase to eliminate part of the adjacency graph calculations and to unify the hardware structure.Third,to make full use of on-chip resources,a dynamic programming algorithm based on resources and performance is proposed to help design resource allocation and minimize the number of pipeline cycles.

Keywords/Search Tags:

2D Pose forecasting, Video generation, 3D human reconstruction, 3D pose estimation, Multi-source domain generalization, FPGA based energy-efficient system

PDF Full Text Request

Related items

1	Research On Video Generation Based On Human Pose Transfer
2	Three Dimentional Pose Reconstruction Of Human Upper Body From Mono-camera Images
3	Research On 2D Human Pose Estimation In Low-resolution Images
4	Research On 3D Human Pose Estimation Based On Monocular Video
5	3D Human Pose Estimation Based On Domain Adaptation Key Technology And Prototype Implementation
6	Design And Implementation Of A 3D Reconstruction System Based On Multi-source Fusion For High-precesion Pose Estimation
7	Human Pose Estimation Method Based On Multi-scale Features Of High Resolution Network
8	Research On Video-based Human Pose Estimation Technology Using Temporal Consistency
9	Research On Deep Learning Algorithms For 3D Human Pose Estimatio
10	2D Human Pose Estimation In In Strong Multi-Person Interaction Scenes Based On Deep Learning