| The talking face generation technology aims to generate synchronized talking videos for speech fragments and facial images.This technology is widely used in humancomputer interaction to enhance user experience.For example,the digital human technology used in many scenarios can achieve ”face-to-face” communication with users through generated online customer service,virtual professors,and virtual anchors.At the same time,it can also be applied to the design of assistive devices for the hearingimpaired people to help them understand the information by converting audio information into visual information.In order to improve the user experience,it is necessary to further improve the authenticity of the generated talking face.The realistic talking face not only includes accurate lip movements,but also includes facial detail movements and head movements,etc.First of all,in view of the low correlation between speech and facial detail movement,and the existing talking face generation methods cannot generate the detailed movement of other areas of the face by directly mapping the speech to the facial movement,this thesis proposes a method based on the facial anatomy,which introduces action units to strengthen the connection between speech and facial detail movement.In this thesis,a new two-stage model is constructed,which maps speech to action units,controls face generation through the action units,and realizes speech driven conversion to control action unit changes to generate talking faces.The experimental results show that compared with the most advanced methods,the method in this thesis can generate more realistic talking face videos for any face,and the generated face motion has richer facial motion details,such as cheek motion and eyebrow movement.In addition,in order to solve the problem of information loss and movement distortion caused by the two explicit modeling methods of facial landmarks and face models in existing talking face generation methods with head movement,this thesis proposes a method based on implicit learning of identity representation and content representation.It avoids the deformation problem of explicit modeling through implicit learning,and extracts identity information and content information from images and speech to express facial movements and head movements.Among them,the identity information of the image is used to maintain detailed information,the content information of the image represents the facial changes during talking,the identity information of the speech learns personalized movement information,and the content information of the speech is used to drive synchronized facial movements.The relationship between facial motion and head motion and speech is found by dissociating and combining the information in the speech and the image,and more detailed information and motion information are expressed through the mask to generate talking face with accurate facial motion and natural head motion.The experimental results show that the talking face with head movement generated by the method in this thesis has more facial details and more accurate facial motion and head motion. |