| With the rapid development of science and technology,the research in the field of face animation synthesis is becoming more and more popular.Speech visualization has become a hot topic of speech research and is widely used in various industries.It has an important guiding significance for language learners and can effectively promote the communication and transmission of information.Based on speech visualization,this study is committed to realize a voice-driven three-dimensional face animation simulation system for speech movements.The key point of this system is to focus on the prediction of speech movements parameters and face image synthesis.It is characterized by the ability to visually and realistically display the real-time action changes of Chinese Putonghua pronunciation.The research mainly includes the following aspects:(1)The construction of voice database.The effective pronunciation text of Chinese Putonghua was designed.Combined with 3D motion capture equipment,the corresponding information of 52 face feature points of standard pronunciation speakers was collected.The data were optimized and analyzed and the parameters were preprocessed to form the speech database required by this experiment.(2)Prediction of articulatory action parameters.Due to the limited database of Chinese speech motion parameters recorded in the laboratory,the Hidden Markov Model(HMM)was introduced to establish the contextual coarticulation motion prediction model,which aims to input any speech fragments,predict the corresponding speech motion parameters,and obtain the facial feature point image of this segment.The experimental results show that the RMSE error(1.3mm)of the tri phoneme coarticulation model is significantly lower than that of the single phoneme model.After setting the attributes and problem sets of the context tri phoneme,the model is clustered and the parameters are reevaluated,and the synthesized optimal articulation trajectory can effectively approximate the real trajectory.Then the image of facial feature points is intercepted frame by frame by the articulation track as the control points to drive the articulation action of the face.(3)Face animation synthesis.A neural network for facial animation synthesis is constructed to train and simulate the motion features corresponding to speech,and then to synthesize personalized and realistic facial animation video.Firstly,a conditionally generated adversation network was designed,and the images of facial feature points were used as input.Through the constraints of real face images and feature points,the real face images of corresponding frames were generated by the convergence of generator and discriminator,and then the images were combined into continuous animated videos frame by frame.And embedded in the memory enhancement network for face personalized fine tuning,and finally with the corresponding audio to achieve the indirect voice driven face animation synthesis.The experimental results show that the proposed phonetic motion synthesis system can effectively simulate the speech movements of the speakers,including the movements of the postures and lips,which is helpful for people to learn and understand the pronunciation and meaning of mandarin. |