Research On Multi-Modal Human-Computer Interaction System And Its Speech Recognition Model

Posted on:2023-05-13

Degree:Master

Type:Thesis

Country:China

Candidate:R R Wang

Full Text:PDF

GTID:2568306818978339

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Multi-modal human-computer interaction is the development trend of natural human-computer interaction.It takes multi-modal information interaction as input and output,and makes full use of the characteristics of various modalities to improve the efficiency and authenticity of human-computer interaction.Recently,the rapid development of machine learning has prompted more and more researchers to figure out how to use the increasingly mature technologies,they are computer vision,speech recognition,speech synthesis,and other techniques in the field of natural language processing,to realize the application of multi-modal human-computer interaction.This thesis implements a Multi-Modal Human-Computer Interaction system(MMHCI)with audio and video as input and output by using various modal information such as speech,text,facial expressions and video.Based on existing research,MMHCI cascades four algorithm modules,including speech recognition,text dialogue,speech synthesis,and facial expression animation generation,to combines modal information such as speech,text,expression,and video.Summarize and analyze the performance test report of MMHCI,find out the main factors affecting the system performance,then propose improvements for the evaluation indicators of the recognition rate and real-time rate of speech recognition technology.This paper adopts the joint CTC/Attention model by analyzing the mainstream endto-end speech recognition algorithms,and proposes two methods to improve the model.First,this paper proposes a hybrid CTC/Attention Encoder-Decoder structure with ProbSparse Self-Attention mechanism(Prob-Sparse CAED).Because the computational complexity of Attention mechanism grows quadratically with the feature sequence length and Prob-Sparse Attention method is proposed to replace the Attention mechanism of Conformer.Multiple sets of experimental tests show,the recognition rate and real-time rate of Prob-Sparse CAED model are improved by 7%～9% and 1%～2% respectively.Second,this paper proposes CTC algorithm with maximum entropy regularization.Due to some defects in CTC training,it is more prone to overfitting,and resulting in overconfidence and peak distribution predictions.This paper proposes CTC algorithm with maximum entropy regularization based on the Prob-Sparse CAED model,and experimental tests show that this method improves recognition rate by nearly 1%.The overall performance of the improved MMHCI and its speech recognition module have significantly optimized.And compared to the old MMHCI system,the word error rate index related to speech recognition has dropped from 1.57% to 0.27%.The MOS score of the MMHCI has been increased from 2.35/5 to 3.43/5,and the performance level has been increased by one level,from a bias towards "poor" to between "good" and "better".

Keywords/Search Tags:

Multi-Modal Human-Computer Interaction, Speech Recognition, Attention

PDF Full Text Request

Related items

1	Design Of Speech Recognition Algorithm For Human Computer Interaction In Machine Operation Environment
2	Design And Implementation Of MCI Human-computer Interaction System Based On Speech Recognition And 3D Emotion Expression
3	Research On Lipreading And Speech Multi-modal Recognition Using Deep Learning
4	Study On Speech Emotion Recognition Technology In Human-computer Interaction
5	Design And Research Of Intelligent Device For Museum Based On Multi-modal Human-computer Interaction
6	Research On Human Action Recognition For Multi-modal Human And Robot Interaction
7	The Design And Implementation Of Human-Computer Interaction Assisted System For Disabled Person
8	Multi-modal Speech Emotion Recognition Based On The Attention Mechanism
9	Research And Development Of Virtual Musical Instruments Based On Multimodal Human-Computer Interaction
10	Research On Speech Emotion Recognition Method Based On Multi-feature And Multi-modal Fusion