Font Size: a A A

Top-down Human Pose Estimation Based On Deep Learnin

Posted on:2024-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:H DongFull Text:PDF
GTID:2568307148462914Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The human pose estimation task aims to locate the joint points of the human body from images or videos.At present,human pose estimation algorithms based on deep learning have high recognition accuracy and fast running speed,becoming the mainstream algorithm in this field.Therefore,the development of current human pose estimation algorithms is closely related to the development of deep learning technology.The proposal of the Transformer makes a significant impact on the relevant fields of computer vision,including human pose estimation.Convolutional neural networks(CNN)and Transformers are different feature extractors with different operational logic,having their characteristics and strengths.How to fully utilize the strengths of two feature extractors,better serving the relevant fields of computer vision,is a topic that many computer vision researchers are exploring.This thesis takes the top-down human pose estimation method based on deep learning as a foothold and deeply explores the application of the fusion framework of CNN and Transformer in human pose estimation.For how to integrate the two frameworks of CNN and Transformer to achieve highperformance human pose estimation and improve the defects of the Transformer,this thesis proposes the following three solutions:(1)A human pose estimation network based on aggregation Transformer and key point purification is proposed.In order to make full use of the local feature extraction ability of CNN and the global feature extraction ability of Transformer,we combine CNN and Transformer in a series.We first use Res Net to extract local features and then use Aggregation Transformer to extract global features.Aggregation Transformer is a Transformer variant designed by us for human pose estimation task.We embed the local fusion module and keypoint head in its Decoder to further extract local features and refine the coordinates of key points.(2)A human pose estimation network based on parallel architecture and hybrid features is proposed.Different from the serial combination in the previous chapter,we propose a parallel architecture of CNN and Transformer to achieve human pose estimation in this chapter.Based on the Inception structure,we introduce the attention mechanism branch to extract global features,while retaining the remaining branches to extract local features.Then,we concatenate and mix the feature maps extracted by each branch and send them to subsequent modules for further recognition.Besides,we adopt Simdr representation to predict the coordinate information of key points and use the KL divergence function to calculate the loss to optimize network parameters.(3)A human pose estimation network based on Vi TPose and a progressive sampling strategy is proposed.When the Patch Embedding module of the original Transformer serializes the feature map,the down-sampling factor is too large,resulting in the loss of feature information.To this end,we design the Gradual Embedding module to replace Patch Embedding.This module uses a progressive sampling strategy to gradually reduce the size of the feature map,thereby alleviating the problem of feature information loss.In addition,we also design a local fusion module that also uses the progressive strategy to replace the original transposed convolution.We combine the bilinear interpolation and maximum unpooling layers to restore a more refined heatmap.
Keywords/Search Tags:Human pose estimation, Convolutional neural network, Transformer, Local features, Global features
PDF Full Text Request
Related items