| Multi-modal human-computer interaction is the development trend of natural human-computer interaction.It takes multi-modal information interaction as input and output,and makes full use of the characteristics of various modalities to improve the efficiency and authenticity of human-computer interaction.Recently,the rapid development of machine learning has prompted more and more researchers to figure out how to use the increasingly mature technologies,they are computer vision,speech recognition,speech synthesis,and other techniques in the field of natural language processing,to realize the application of multi-modal human-computer interaction.This thesis implements a Multi-Modal Human-Computer Interaction system(MMHCI)with audio and video as input and output by using various modal information such as speech,text,facial expressions and video.Based on existing research,MMHCI cascades four algorithm modules,including speech recognition,text dialogue,speech synthesis,and facial expression animation generation,to combines modal information such as speech,text,expression,and video.Summarize and analyze the performance test report of MMHCI,find out the main factors affecting the system performance,then propose improvements for the evaluation indicators of the recognition rate and real-time rate of speech recognition technology.This paper adopts the joint CTC/Attention model by analyzing the mainstream endto-end speech recognition algorithms,and proposes two methods to improve the model.First,this paper proposes a hybrid CTC/Attention Encoder-Decoder structure with ProbSparse Self-Attention mechanism(Prob-Sparse CAED).Because the computational complexity of Attention mechanism grows quadratically with the feature sequence length and Prob-Sparse Attention method is proposed to replace the Attention mechanism of Conformer.Multiple sets of experimental tests show,the recognition rate and real-time rate of Prob-Sparse CAED model are improved by 7%~9% and 1%~2% respectively.Second,this paper proposes CTC algorithm with maximum entropy regularization.Due to some defects in CTC training,it is more prone to overfitting,and resulting in overconfidence and peak distribution predictions.This paper proposes CTC algorithm with maximum entropy regularization based on the Prob-Sparse CAED model,and experimental tests show that this method improves recognition rate by nearly 1%.The overall performance of the improved MMHCI and its speech recognition module have significantly optimized.And compared to the old MMHCI system,the word error rate index related to speech recognition has dropped from 1.57% to 0.27%.The MOS score of the MMHCI has been increased from 2.35/5 to 3.43/5,and the performance level has been increased by one level,from a bias towards "poor" to between "good" and "better". |