Font Size: a A A

Research On Visual Cross-modal Learning For Digital Human Interaction

Posted on:2024-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1528307352485014Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Digital human interaction technology has great application potential and market prospects in many fields such as virtual reality and smart home.One of the most important issues in the research of digital human interaction is how to achieve a highly natural and efficient interaction between digital human and persons through cross modal learning from multiple data sources,such as images,speech,and actions.This involves authentication of the person identity during the interaction process,generation of natural interactive speaking portrait,and recognition of saliency in the interaction scene.The research focus of this dissertation is on the visual based cross modal learning problem in digital human interaction applications,and in-depth research has been conducted on three specific issues,namely,face liveness detection using both depth and image information,speech driven speaker video generation,and saliency recognition and gaze estimation in interactive scenes considering color factors.This dissertation is based on deep neural network models and cross modal learning approaches,aiming to improve the experience and performance of digital human interaction applications.The main contributions and innovations are mainly reflected in the following aspects:Firstly,this dissertation proposes a cross modal face liveness detection model that organically integrates depth and image information.A convolutional neural network structure is designed to extract features from depth maps and two-dimensional images,and a multimodal fusion and decision network model is proposed to fuse the two modal information in live detection and judgment.In addition,a multimodal face liveness detection dataset is built based on Kinect and preprocessed to achieve data enhancement and cross modal data alignment.The experimental results show that the proposed method has higher accuracy and stronger robustness than existing methods.Secondly,this dissertation proposes a new speech driven speaking human video generation method.This method uses a deep network model to learn the mapping relationship between speech and visual images,and uses an attention generation adversarial network to generate realistic speaker videos.A lip sync discriminator based on dual attention mechanism and a high-definition image generation network module are introduced into the model,effectively solving the problems of insufficient naturalness of actions and low resolution of generated images.The experimental results show that the proposed method has achieved excellent results on multiple datasets and can generate high-quality digital human speech videos.Thirdly,this dissertation proposes a novel saliency detection model and a novel fixation point estimation model that fuse color prior information.Based on the data collected by the eye tracker,a color gray contrast dataset is innovatively constructed.Based on this,the independent contribution of color in visual saliency is analyzed.The proposed fusion model can effectively combine the prior factors of color perception to achieve more accurate saliency target recognition and gaze point estimation in interactive scenes.The experimental results verify that the proposed method has higher accuracy and stronger robustness than existing methods.
Keywords/Search Tags:Digital Human Interaction, Cross-Modal Learning, Face Liveness Detection, Speaking Video Generation, Saliency Detection, Gaze Estimation
PDF Full Text Request
Related items