Font Size: a A A

Research On 3D Facial Animation Generation Based On Deep Learning

Posted on:2024-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:X J JiFull Text:PDF
GTID:2568306932961959Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The human face is the window through which humans express emotions.Changes in facial expressions and muscle movements can convey information such as emotions,intentions,and communication desires,which play a crucial role in social interaction and interpersonal communication.Therefore,simulating realistic facial animations in the virtual world has been an important topic of research in both industry and academia,and is also a significant trend in the development of human-computer interaction technology.Traditional 3D facial animation is usually created by animators who manually model keyframes or by using facial capture technology.The former requires a lot of time and effort,while the latter requires expensive capture equipment.Therefore,considering other input modalities such as speech and video as supplementary methods for inferring facial movements and generating corresponding facial animations has significant research implications.This study focuses on generating realistic 3D facial animations that are lip-synced,pose-controllable,and natural-looking from speech and video inputs,which can be summarized as follows:1.The research on speech-driven 3D facial animation generation mainly focuses on the lip-sync part.Using speech information as input,the model is trained on the 4D audio-visual dataset VOCASET.Firstly,to compensate for the sparseness of the 4D data and improve the robustness to noise and outliers,as well as the cross-lingual generalization ability,a self-supervised pre-training model,wav2vec2.0,is used to extract speech features.Secondly,to better learn the geometric structure of the 3D face model,mesh convolution is introduced.Mesh convolution has locally invariant filters,which significantly reduces the number of parameters in the network by sharing the grid surface.Finally,an encoder-decoder network is designed and constructed using temporal convolution and mesh convolution to fit the complex mapping relationship between speech features and 3D facial models.Comparative experiments demonstrate the effectiveness of wav2vec2.0 speech features and mesh convolution,and objective and subjective evaluation experiments demonstrate the superiority of the proposed neural network model based on mesh convolution compared to existing advanced methods under a lightweight training parameter.2.The research on audiovisual-driven 3D facial animation generation mainly focuses on pose control and natural expression.Based on the speech-driven method,an additional expression-pose network is constructed,and experiments are conducted on a public 2D audio-visual dataset.Firstly,the speech-driven network is fixed,and its output 3D facial lip-sync animation is used as a basis while also extracting speech features from the encoder output.Then the previous video frame preprocessing algorithm is improved,and the expression-pose network uses a backbone network to extract visual features from the preprocessed video frames,followed by time convolution to extract temporal visual features.Finally,a separate temporal visual feature is used to regress only visually related expression parameters and head pose parameters,and a fusion feature of speech and temporal visual features is used to regress chin pose parameters related to lip movements,which is considered as the fine-tuning of mouth movements for the lip-sync animation.In addition,various consistency losses are introduced to improve the network’s fitting effect on expressions and pose.Through ablation experiments,the effectiveness of each part is explored,and objective and subjective evaluations demonstrate that the proposed audio-visual joint-driven 3D facial animation generation method achieves the best results compared to other models.
Keywords/Search Tags:3D facial animation, speech-driven, mesh convolution, audio-visual-driven, feature fusion
PDF Full Text Request
Related items