Font Size: a A A

Research On Multi-Speaker Speech Separation And Speech Recognition In Noisy Environment

Posted on:2024-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhangFull Text:PDF
GTID:2568307181950869Subject:Electronic Information (Artificial Intelligence) (Professional Degree)
Abstract/Summary:PDF Full Text Request
Speech recognition is the process of converting a speaker’s voice into text information that conformers to grammatical rules.High accuracy and robustness in speech recognition are important natural language processing technologies for building intelligent robots’ auditory systems and achieving human-machine interaction.In recent years,the vigorous development of deep learning has significantly improved the accuracy and robustness of speech recognition.However,in noisy environments with multiple speakers,the sound signals are subject to degradation such as noise attenuation and interference.Therefore,researching key technologies such as speech enhancement,multi-speaker speech separation and high-accuracy speech recognition in noisy environments is critical to promoting the development of speech recognition to meet practical application scenarios’ needs.The research contents of this paper are as follows:(1)Research on the enhancement technology complex domain fullsubnet with timefrequency attention in noisy environment.Fullsubnet can effectively capture the full-band contextual frequency information and obtain the local spectral information of the signals’ stability but lack the ability to represent the signals’ time-frequency energy distribution.To address this problem,a time-frequency-aware module that can effectively capture the distribution information of speech signals is integrated into the complex domain fullsubnet to construct a new complex domain fullsubnet with time-frequency attention.Comparative experimental results show that the proposed model has obvious speech enhancement performance.(2)Research on the time-domain speech separation technology with convolutionaugmented external attention model.Expand the external attention to both spatial and channel dimension,then combining it with convolution-augmented module and convolutional position coding module,a new convolution-augmented external attention module is proposed.On this basis,the convolution-augmented external attention module is applied to the encoder-decoder structure of Tas Net to model speech signals,thus proposing a convolution-enhanced external attention time-domain speech separation model(Ex Con Net).Comparative experiments show that the Ex Con Net model has smaller parameters and better multi-speaker speech separation effects.(3)Research on CTC/Attention speech recognition technology with interactive feature fusion.ACmix-CTC/Attention speech recognition model is proposed to improve clean speech recognition performance.For noisy environment speech recognition scenarios,interactive feature fusion module is used to jointly train the CTC/Attention speech recognition model and the speech enhancement model to improve the robustness and accuracy of noisy speech recognition.Experiments on clean datasets show that the proposed speech recognition model has lower word error rates.Tests on noisy datasets also show the effectiveness of CTC/Attention speech recognition with interactive feature fusion for noisy environment speech recognition.
Keywords/Search Tags:Speech enhancement, Speech separation, Speech recognition, Deep learning
PDF Full Text Request
Related items