| Audio-driven talking face generation is designed to generate a speech video of the character using any voice and the face image of the target character.The technology can be used in game production,virtual anchors,online education,film editing and other scenarios.However,due to the lack of high-resolution audio-visual data sets,the current research method fails to synthesize high-resolution talking face videos well,and does not complete the details of arbitrary human-generated lip motion synchronized with audio.Moreover,existing methods focus on face synthesis quality and lip-synchronization,ignoring the naturalness of head movements,resulting in the head posture being fixed in the video.This paper carries out research on the talking face generation technology,our work and contents are as follows:Firstly,a high-definition talking face generation algorithm based on lip-synchronization is proposed to generate high-resolution talking face videos with lip movements synchronized with input audio.The method proposes an HD face generation network with a face reconstruction module to generate speaker face images corresponding to the audio,and the face reconstruction module performs super-resolution reconstruction of the generated images to recover face details and skin texture.At the same time,the lip-synchronization discriminator with dual attention mechanism was added to accurately judge the degree of mouth action and audio synchronization in the generated image,and the trained lip-synchronization discriminator was used to monitor the generating network to generate accurate lip motion.Experiments show that this method can effectively generate high-definition speech face video with lip synchronization.Secondly,a speaker face generation algorithm based on a 3D face model is proposed,which generates speaker face videos with natural head motion.This method uses 3D face model to reconstruct 3D face images,and represents the face shape,expression,posture and other information with parameters.We propose a speech feature extraction network with delayed long-short memory network that predicts expression parameters and postural parameters from audio,and combines with 3D face parameters to generate new 3D face parameters.The 3D face and background fusion images are input into the face image generation network for rendering.The output face image and real images are compared by the discriminator to generate high-quality speaker face images.Experiments show that to input different audio,the method can generate natural speech videos of head movement.Thirdly,a high-definition Chinese video dataset is built,including five news anchors with4,600 video segments with a total length of about 530 minutes.The data set is selected and cut from the public high-resolution news videos,and is characterized by low background noise,anchor pronunciation standard and correct posture,which is suitable for the speaker face generation task. |