Font Size: a A A

Design And Implementation Of Virtual Human Interaction System For Voice And Lip Synchronization Based On Speech Driving

Posted on:2024-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:T T LiuFull Text:PDF
GTID:2568306944470424Subject:Communication Engineering (including broadband network, mobile communication, etc.) (Professional Degree)
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence and meta-universe,users have put forward new requirements and expectations for human-computer interaction.The interaction mode based on virtual human provides a new idea for human-computer interaction at the present stage and has a wide range of application space.Virtual face animation with voice and lip synchronization plays a crucial role in the interaction process with virtual human,which greatly affects the user’s sense of interactive experience and limits the extensibility of virtual human interaction.Therefore,it is necessary to design and implement a virtual human interaction system of voice and lip synchronization with high practicability,strong scalability and good user experience.The core problem of virtual human voice lip synchronization is to establish the mapping of speech and face visual features.The current mapping methods have some problems,such as incomplete feature extraction leading to inaccurate prediction of face animation,weak real-time performance and poor drive of output parameters.which cannot be effectively applied to virtual human interaction system.In this paper,an accurate and efficient method of speech and face visual feature mapping is established,and on this basis,a virtual human interaction system of voice and lip synchronization is designed and implemented.The main research contents and innovations are as follows:1.A voice and face visual feature mapping method based on CNN-Transformer is proposed to accurately and efficiently complete the task of virtual face voice and lip synchronization.This network connects convolutional neural network and Transformer network in parallel,coupling local features obtained from convolutional neural network and global features obtained from Transformer network to obtain comprehensive feature information.The predictive performance of the network is better than that of convolutional neural network.Under the objective evaluation criteria Mean Square Error(MSE),Mean Absolute Error(MAE)and Root Mean Square Error(RMSE),the performance is improved by 27.72%,6.19%and 14.93%,respectively.In order to solve the problem of low efficiency of network training,Attention Free Transformer mechanism(AFT)is applied to improve the coding layer structure of Transformer network.The improved network can further reduce the time cost of model training without affecting the accuracy of prediction.Compared with the original network,the improved network reduces the time cost of model training by 4.8%.2.Based on the mapping method of appeal speech and face visual features,a virtual human interaction system for voice and lip synchronization based on speech driving is designed and implemented using the Unity engine.The system mainly consists of four functional modules,namely,voice response module,command control module,audio driver module and text driver module.The key technologies involved in this system include speech recognition technology,speech synthesis technology,UDP communication technology and virtual human voice lip synchronous driving technology.Firstly,the virtual human interaction system is standardized from the aspects of functional requirements and non-functional requirements.Reasonable and efficient development schemes are designed for each module,and target requirements such as voice and lip synchronous virtual human interaction,extensible command control and voice recognition response are realized.After testing,the completeness and practicability of the designed system are proved.
Keywords/Search Tags:speech drives virtual human, deep learning, interactive system
PDF Full Text Request
Related items