| Human-computer interaction(HCI)refers to the way in which humans and machines interact with each other through specific means.Gesture recognition is one of the most natural forms of communication,offering advantages such as being natural,direct,and efficient.Gesture recognition algorithms are the core technology that allows for gesture interaction.The goal is to interpret the user’s intention from the collected hand data,which is in great demand in various fields such as virtual reality,augmented reality,elderly assistance,and intelligent homes.With the rapid development of deep learning,the accuracy and stability of gesture recognition algorithms have significantly improved,and researchers have higher demands for gesture data.However,most traditional gesture data are based on images of bare hands,which can result in problems such as poor algorithm robustness,high investment costs for three-dimensional annotation,and limited model verification in real scenes with dim lighting.Therefore,there is a growing need for multimodal gesture data generation in complex scenes.In this paper,multimodal gesture data synthesis methods for complex scenes are studied.The main research contents and results are as follows:(1)A set of multi-modal gesture acquisition platform is built,including multi-modal gesture acquisition equipment and multi-modal gesture acquisition program.The platform can collect RGB image,depth image and hand IMU data at the same time.Meanwhile,multi-threading can be used to disperse the data processing pressure,so as to improve the collection rate.The experimental results show that the collected data have high accuracy and synchronization.(2)A set of multi-view synchronous annotation software is developed,which can simultaneously annotate four images in the same group and view the annotation results in real time.Compared with traditional annotation software,the developed software can better solve the problem that hand data self-occlusion is difficult to annotate,and has many advantages such as intuitive annotation results,convenient adjustment of annotation position and high annotation efficiency.The experimental results show that the software achieves better results in reducing the errors of the acquisition platform and the labeling software.In addition,the software utilizes the existing 2D key-point attitude estimation network to automatically generate 3D key-point data in batches,greatly reducing the labor and time cost required for annotation.(3)Based on the unsupervised generative adversarial network,an image selection module is proposed,which can adjust the hand image by using the unpaired data set according to the image background in the generated synthetic data set,which can effectively improve the inconsistency between the hand brightness and the background and the uneven edge of the hand,and make the synthetic image closer to the real image.Experimental results show that the fusion effect of this method is better than other mainstream image fusion algorithms.To sum up,based on the task of multi-modal gesture data generation in complex scenes,this paper builds a multi-modal data acquisition platform,develops multi-perspective synchronous annotation software,and proposes an unsupervised image fusion method,which has been verified by experiments and obtained good results. |