Font Size: a A A

Study On Speech Recognition Technology In Complex Scenes

Posted on:2024-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:E W LiuFull Text:PDF
GTID:2568307079465844Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Speech recognition technology uses computers to automatically analyze and convert human speech,and convert speech signals into corresponding text or instructions.Speech recognition technology is also widely used in daily life.The current commercial speech recognition system has a very high recognition accuracy in relatively quiet situations,and the recognition effect of clean speech has reached or exceeded the human level.However,when there is noise in the surrounding environment,the recognition rate of the speech recognition system will drop sharply.Therefore,it is very important to study the speech recognition technology in complex scenes,and it has practical significance.This thesis explores the scheme of realizing the environmental robustness of the speech recognition system in complex scenarios,proposes a loss function based on human voice sensitivity to reduce speech distortion to improve the accuracy of the speech recognition system,and further studies how to enhance the front end of the speech Combined with the speech recognition backend to improve system performance.The main work of this thesis is as follows:(1)In the simulated complex environment,aiming at the voice distortion problem caused by the noise reduction of the front-end voice enhancement system,the TCN is used as the backbone network of the enhancement model,and the MSE(Mean Square Error)is improved on the basis of the design.A series of loss functions based on human voice sensitivity,including loss functions for coefficient compression,loss functions for adding penalty items,and loss functions for combining coefficient compression and penalty items,are verified by experiments on AISHELL-1 data.Several loss functions can effectively reduce speech distortion,and the proposed loss function has a small word error rate in the speech recognition system,so that the enhanced model avoids the loss of human voice as much as possible while denoising,thereby further improving the accuracy of the speech recognition system in the complex scenarios.(2)In real and complex scenarios,this thesis mainly conducts research based on the CHi ME3 series of competitions.Speech enhancement models in the frequency domain and time domain are applied as front-end technologies to the speech recognition system,but the mismatch of front-end and back-end features will lead to the degradation of the performance of the recognition system.To solve this problem,on the basis of pre-training,the speech enhancement front-end and the end-to-end speech recognition back-end are jointly trained to obtain better recognition performance.After joint training,although the performance of the speech recognition system can be improved,but this is achieved at the expense of the performance of the pre-trained speech enhancement system,that is,at the expense of the human auditory speech enhancement performance.To this end,the speech enhancement front-end is improved,and multi-target learning is applied to the speech enhancement front-end,thereby further improving the performance of the speech recognition system in complex scenarios.Specifically,the word recognition error rate of the speech recognition system on the CHi ME3 dataset is increased from 33.04% to 9.98%,which proves the effectiveness of the proposed joint speech enhancement front-end and speech recognition back-end method.
Keywords/Search Tags:Complex Environment, Robust Automatic Speech Recognition, Speech Enhancement, Joint Training, End-to-end Method
PDF Full Text Request
Related items