Font Size: a A A

Pronunciation Evaluation Algorithm Based On Deep-Learning

Posted on:2023-10-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z ZhangFull Text:PDF
GTID:1528307301458544Subject:Electronics and information
Abstract/Summary:PDF Full Text Request
In the era of globalization,the communication between people in different regions has become more and more close.Therefore,more and more people have the need to learn a second language.However,in contrast to this rapidly growing demand for language learning,most learners are faced with a lack of educational resources.Therefore,how to better provide educational resources related to language learning is a valuable research.With the rapid development of speech recognition technology and Internet communication,computer-assisted oral language evaluation algorithms have also begun to be applied to teaching.Compared with traditional classroom teaching,the oral evaluation algorithm has the advantages of more objective evaluation feedback,more sufficient learning resources,and more friendly learning methods.Therefore,the oral evaluation algorithm has become a research hotspot in recent years.The main work of this paper includes the following three points:(1)This paper proposes an end-to-end Transformer model based on text priors,which solves the problem that the spoken language evaluation model based on speech recognition cannot directly optimize the oral evaluation performance.At present,the mainstream deep learning oral language evaluation algorithm is based on speech recognition to determine misreading.Such algorithms do not consider the target text in the oral evaluation task in the training phase,and only indirectly improve the performance of oral evaluation from the direction of improving the accuracy of speech recognition.However,the target text can greatly constrain the decoding space of the actual pronunciation,thus making the spoken language evaluation more accurate.The model proposed in this paper takes the strong prior information of the target text as a conditional input to the Transformer Decoder,which can fuse speech recognition and the alignment process of the target text to predict the error state end-to-end.This method can directly optimize the performance of spoken language evaluation,which is greatly improved compared to the indirect optimization method based on speech recognition.In L2-Arctic,a commonly used data set for oral language evaluation,the F1 score of the mainstream speech recognition method in the industry is 0.475,and the method proposed in this paper is improved to 0.577.In addition,since the target text is used as the input condition,the inference process of this method is feed-forward,replacing the original autoregressive method,which significantly improves the speed of inference.The inference speed of this method on the test set is improved by nearly 9 times.(2)In this paper,a method for modeling accented speech features(L2 speech features)based on self-supervised acoustic units and a training process for simulating misread data based on semantic distance are proposed,which alleviates the overfitting of directly modeling L2 speech features using supervised learning.For deep spoken language evaluation models,more training data is often required to accurately model accented speech features.However,since the spoken language evaluation task requires experts to label the actual pronunciation of the data at the phoneme level,the training data related to the spoken language evaluation is relatively scarce,and directly using supervised learning models to train on accented data often results in overfitting.The current mainstream methods solve this problem by data augmentation or replacing the target text,but such methods do not further analyze the cause of misreading,so they cannot generate more realistic accent data to help supervised models for modeling.In order to solve this problem,this paper converts the original audio with accents into semantic vectors through a self-supervised model,and discretizes the vectors into acoustic units by means of k-Means clustering,so that the accented audio is modeled without manual annotation.This method uses this acoustic unit as a medium,finds out the similar pronunciation of a certain original speech feature according to the semantic distance and replaces it,and simulates a more realistic misreading.By pre-training on this simulated misreading data,the F1 score of the aforementioned end-to-end spoken language evaluation model can be further improved to 0.607.In the case of using only 20% labeled data,this method can still obtain an F1 score of 0.509.(3)In this paper,a pronunciation correction algorithm based on acoustic units and generative models is proposed.This algorithm can retain the correct pronunciation of the speaker and correct the wrong pronunciation,which solves the problem that the existing spoken language evaluation model cannot provide personalized speech modal feedback.Most of the existing spoken language evaluation models are discriminative,so generally they can only give feedback in the form of text,which is relatively simple and often has problems that are difficult to understand.Correction based on the user’s input speech provides more intuitive speech modal feedback,while preserving the speaker’s style and speed,making the feedback more personalized,and helping users better perceive the difference in pronunciation before and after correction.In order to achieve this goal,this paper uses the acoustic unit as the medium to simulate the paired misreading data according to the standard pronunciation,and proposes a single-stage pronunciation correction model to correct the wrong acoustic unit and further convert it to the standard pronunciation.On this basis,this paper further proposes a two-stage pronunciation correction model based on Normalizing Flow to achieve reversible semantic extraction and speech synthesis,which achieves a better generation effect than the single-stage model.Compared with the speech synthesis model generated directly based on text,the model proposed in this paper can also generate standard pronunciation,but can better preserve the user’s speaking style,making the feedback more personalized,thereby helping the user to better perceive correction.The difference in pronunciation before and after guides users to learn oral language from the perspectives of perception and pronunciation.
Keywords/Search Tags:Speech Recognition, Computer-Assisted Language Learning, Computer-Assisted Pronunciation Training, Pronunciation Evaluation, Speech Processing, Mispronunciation Detection
PDF Full Text Request
Related items