| With the rapid development of the field of artificial intelligence,intelligent robots have penetrated into people’s daily life from all aspects,and the status of human-computer interaction has become increasingly prominent.Human-computer interaction based on human hands plays an important role in many occasions,such as using hands for information input,screen conversion,and information acquisition in virtual reality.Therefore,as the basic work of hand interaction,3D hand pose estimation has very important research value.In recent years,with the development of deep learning and convolutional neural networks,3D hand pose estimation based on RGB images has attracted extensive attention and research,and achieved good results.However,this task is still very challenging due to the lack of the depth information,cluttered background,diverse poses and difficulties in obtaining 3D pose annotations,so it is necessary to conduct in-depth research on this task.The research contents of this paper are as follows:(1)Based on the disentanglement representation learning theory and the cross-modal Variational Autoencoder(VAE)model,we derive a “Single Input Multiple Output”(SIMO)disentangled model cm SIMO-β VAE.With the guidance of this derived model,we design a new VAE network,named da-VAE,for the challenging task of 3D hand pose estimation from a single RGB image.The network uses a variational auto-encoder structure to encode the input image,and combines the output of the decoder and the attention module to disentangle the latent space decompose it into subspaces representing semantic information of hand pose,hand shape and hand appearance(color/texture)respectively to obtain more effective hand pose information,so as to achieve more accurate hand pose estimation.Experimental results on several public datasets show that the information learned in the decomposed subspace conforms to the given semantic information.Our method can predict relatively accurate 3D hand poses from a single RGB image,which is comparable to the current state-of-the-art.(2)Insufficient training data with 3D pose annotations is a major factor affecting the performance of 3D hand pose estimation.However,the process of obtaining accurate 3D hand pose annotations is difficult and time-consuming,and the acquisition cost is relatively high.This paper proposes a semi-supervised disentanglement network structure,which extracts more effective hand information by disentanglement of RGB images at the feature level,and realizes the prediction of 3D hand poses from RGB images in a semi-supervised manner.The network fully exploits the potential characteristics of unannotated RGB data to improve the accuracy of hand pose estimation and reduce the model’s dependence on annotated data.The experimental results show that the method in this paper can predict more accurate and reasonable 3D hand poses from a single RGB image,which proves that augmenting the annotated training dataset with unannotated RGB images can improve the accuracy of hand pose estimation. |