| English speech is a multimodal communication scene that requires the speaker’s verbal and nonverbal coordination.In English speech evaluation,manual scoring is costly and subject to subjectivity.Under the background of high integration of artificial intelligence and educational technology,the development of multimodal intelligent evaluation system for English speech is of great importance to speech training and evaluation.This paper is based on the verbal and actions of English speech to build the English speech intelligent evaluation system based on three modalities of audio,text and video.The main work is as follows:According to the needs of multimodal speech evaluation,this paper constructs the multimodal intelligent evaluation system framework for verbal and actions in English speech,develops 3D data acquisition equipment to collect English speech data,and the speech scoring experts score the speech data.Comprehensively extracts the features of speech data from three modalities of audio,text,and video,and adopts the PCA algorithm based on truncated singular value decomposition to reduce the feature dimension and optimize the feature selection.An ensemble learning strategy was proposed to construct evaluation models to evaluate the whole item and single item of speech performance.The results show that the Pearson correlation coefficients between machine score and corresponding expert score are 0.816,0.669 and 0.758 for nonverbal item,verbal expression item and language use item,respectively.In the overall item evaluation,the Pearson correlation coefficient between the scores predicted by the machine and experts is 0.859;According to the scoring scale,the allowable error range was 0.5 points,while the mean absolute error,error variance and model R-square value of the overall item machine scoring were 0.213,0.286 and 0.718 respectively.This shows that the machine score based on the ensemble learning model proposed in this paper is highly correlated with the artificial score,and the model achieves good prediction results.Due to the overall deduction of speech is the result of multimodal communication,there are interaction,supplement and other relations between the modes.For the overall item evaluation,this paper proposes a multimodal deep learning network based on spacetime fusion,which is composed of four parts:interframe fusion network,text model,speech model and fusion network,to further extract and fuse multimodal features.In the case of small training sets,the Pearson correlation coefficient between the network prediction score and the expert score is 0.789,the average error of their score is 0.247,the error variance is 0.300,and the R-square value of the model is 0.603.The scores predicted by the model are highly correlated with the manual scores,and their error variance is lower than that between the human scores.In the future,the scoring model and evaluation system can be further improved to provide practical and efficient scoring tools for English speech training and actual scoring. |