Human-computer interaction refers to the process of information exchange between humans and computers using some kind of conversational language and in a certain interactive way to accomplish a defined task.Hand pose estimation and gesture recognition technology have broad application prospects in the field of human-computer interaction.Traditional input devices can no longer meet people’s needs for natural and intuitive interaction.As a nimble and effective executor,the hand plays an important role in daily life.Gesture estimation and recognition technology can recognize and understand human gesture movements and convert them into computer instructions,thereby achieving natural interaction with the computer.Gesture estimation and recognition technology has been widely applied and developed in fields such as entertainment,consumer goods,smart homes,medical care,industrial design,intelligent driving,and space applications and has had a profound impact.In recent years,gesture estimation and recognition technology has been widely applied and developed in fields such as entertainment,consumer goods,smart homes,medical care,industrial design,intelligent driving,and space applications and has had a profound impact.based on deep learning has been able to achieve gesture recognition through ordinary RGB cameras,greatly reducing costs and providing more free and natural interaction methods.With the continuous development of deep learning and neural network technology,the accuracy and real-time performance of gesture estimation and recognition technology are also constantly improving,bringing more possibilities and convenience to human-computer interaction.To improve the accuracy of hand pose estimation and solve the problems of complex hand segmentation masks and difficulty in recognizing different scale gestures,we propose a novel neural network model called CH-HandNet.CH-HandNet consists of three modules:hand segmentation mask,preliminary 2D hand pose estimation,and hierarchical estimation.The hand segmentation mask module consists of upper and lower branches and uses a label of the hand mask to guide the learning of hand segmentation.The hierarchical estimation module estimates the posture of different fingers and palms to optimize the estimation of different scale hand poses.Hierarchical estimation is the main optimization strategy based on gesture topology.First,we merge the palm and thumb,followed by merging the other fingers,and finally merging these two branches together.This step-by-step hierarchical approach further improves the performance of the model.Experimental results show that our proposed method has significant advantages in hand pose estimation and prediction accuracy.At the same time,our method can effectively solve the problems of complex hand segmentation masks and the difficulty in recognizing different scale gestures.To overcome the complexity and poor adaptability of the CH-HandNet model,as well as the challenges of losing too much feature information due to downsampling in gesture pose estimation,low usage of gesture pose information for gesture recognition applications,and low accuracy of key point localization,we propose a simple framework convolutional neural network model named Fishbone Skeleton Convolutional Neural Network(FS-HandNet).The model mainly consists of three parts: the fish head adopts an efficient bidirectional pyramid structure(Bi PS)to effectively alleviate the information loss caused by feature downsampling and small target feature extraction;the fish body utilizes a high-resolution preservation structure with asymmetric convolution(HRACS)to maintain high resolution and enhance its feature extraction ability and network robustness to image flipping;and the fish tail adopts a simple deconvolution head structure(Dc HS).To implement an application based on hand pose information for gesture recognition,we use a fish skeleton network structure to predict hand pose information and recognize multiple gestures based on a convex hull algorithm and hand pose information.The experimental results show that our method achieves the best performance.By using the efficient Bi PS and asymmetric convolution HRACS structure,we have successfully solved the problem of information loss caused by downsampling and small target feature extraction,thereby improving the model’s adaptability and performance.In addition,our model can also be applied for the recognition of multiple hand gestures.To address the challenges posed by the FS-HandNet model in terms of parameter volume,computational complexity,complex network structure,speed-accuracy trade-offs,and application issues,we propose a novel method called MSIPA-HandNet.This method utilizes the Multi-Scale Information Perception structure based on Attention mechanism(MSIPA)structure to extract multiscale information during the downsampling process while limiting the growth of model parameters.Next,we use upsampling to restore resolution and locate keypoints.We then adjust the position information of keypoints based on the Distribution-Aware coordinate Representation of Keypoints(DARK)algorithm to improve the model accuracy,and the speed-accuracy trade-off(SAT)metrics are proposed to evaluate the model performance based on the constructed model and the DARK algorithm used.Finally,we use the keypoint position information obtained from hand pose estimation for real-world applications.We conducted experiments on a public hand pose dataset,and the results show that our proposed method outperforms state-of-the-art methods in several aspects.This approach not only reduces the complexity of the model but also improves estimation accuracy,enabling various applications on the computer side.To address the low accuracy of the FS-HandNet model in gesture recognition tasks,as well as to resolve issues related to excessive model parameters and difficulties in deployment and application,we propose a lightweight gesture recognition network(LHGR-Net).LHGR-Net mainly consists of three parts: a basic network structure,a multi-scale structure(MSS),and a lightweight attention structure(LAS).The motivation for this design is that MSS and LAS can enhance the network’s representation power,with MSS considering both global information and local details,while LAS can handle long-range dependencies and make the network more attentive to useful information in context.We combine these structures to fully utilize their strengths and compensate for their weaknesses.A complete process of gesture recognition algorithm,model deployment and application is realized.Experimental results show that compared to state-of-the-art methods,the LHGR-Net model has higher accuracy and faster inference speed,and can be successfully deployed on a Raspberry Pi. |