| With the development of deep learning technology and the rise of the virtual human industry,talking face synthesis based on computer has become an important research area in the study of artificial intelligence.In the field of virtual customer service,virtual teachers,virtual broadcasting,real digital twins and other applications,virtual characters are needed to show a natural and smooth facial form in the process of interaction,especially the lip animation that matches the content of the speech.In the field of deep learning,researchers often reconstruct the talking face by building the embedding between the audio features or text features and the facial features with the method of the artificial neural network.In recent years,the research on image generation based on Generative Adversarial Network(GAN)becomes the main method of talking face generation and lip animation generation.By changing structure of the generator and the discriminator of GAN,this paper improved the quality of video frame,the similarity to the target video and and audio-visual consistency in the lip animation generation task.At the same time,this paper expanded the text-driven method based on the audio-driven method,realizing the construction of the lip animation generation model driven by text.The main work is as follows:(1)Aiming at the problem of fluency of generated video,this paper proposed an auto-encoded video frame generator using UNet+Conv GRU hybrid model,which improved the naturalness of the transition between generated video frames.Relying on the dual encoding structure of CGAN and deep convolution principle of DCGAN,and combining the auto encoder method of UNet architecture,this paper improved the generator of GAN.This paper added LSTM and GRU to the basic full convolutional audio encoder,and adopted the UNet+Conv GRU hybrid-model generator at the last.Compared with the baseline model,LSE-D value decreased from 13.196 to 13.147,LSE-C value increased from 0.848 to 1.251,and FID value decreased from 2.301 to 2.175.(2)Aiming at the problem of blur lip area in reconstruct facial video,this paper proposed a dual visual quality discriminator based on the Global-Local structure,which improved the clarity of the lip region and the overall quality in the generated video frame.Considering the local continuity features and the global composition features of the face,a local discriminator was added to the model.The global discriminator based GAN associated with our improved Markov local discriminator constituted the dual visual quality discriminator with the Global-Local structure.Compared with the baseline model,the CPBD value increased from 0.194 to 0.204,PSNR value increased from 22.801 to 24.012,and SSIM value increased from 0.987 to 0.995.(3)Aiming at the limitations of the input patterns,this paper expended a lip animation generated method driven by text based on the audio-driven method,which met the needs of generating talking face by inputting text in practical applications.This paper evaluated the model in both quantitative and qualitative way.For both audio and text input,compared with Speech2 Vid and Wav2 Lip,the results of the contrast experiment showed that our model obtained the optimal LSE-D,LSE-C and FID,and the qualitative evaluation results showed that our model obtained a high visual quality score,audio-visual synchronization accuracy score,overall perception score and model preference selection.To sum up,the model this paper proposed realized the high quality of lip animation generation driven by text. |