Font Size: a A A

Structured Deep Learning For Adaptive Speech Recognition

Posted on:2019-09-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:T TanFull Text:PDF
GTID:1368330590470383Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Although great progress has been made in automatic speech recognition(ASR)due to the deep neural network based hidden markov model(DNN-HMM),significant performance degradation has been observed in noisy environments caused by the mismatch of acoustic conditions between the training data and test data.Hence,adaptation and adaptive training of DNN are of great research interest.In previous DNN adaptation works,large number of parameters are required to be estimated during adaptation due to lack of prior knowledge of the DNN structure.So it cannot use the limited adaptation data efficiently.In this thesis,structured deep learning is proposed to do adaptation and adaptive training in DNN more efficiently.It includes two parts:structured deep learning based feature adaptation and structured deep model for adaptive training.Further more,adaptive very deep convolutional residual network(VDCRN)is proposed in this thesis and obtains the best performance in noise-robust speech recognition.Structured deep learning based feature adaptation focus on the context-aware training based DNN adaptation framework,includes speaker-aware training based recurrent neural network(RNN)adaptation and DNN based online adaptation.Firstly,we propose speaker-aware training based RNN adaptation,this is the first work to apply speaker-aware training in RNN,in additional to the basic concatenating structure,we investigate two new structures to avoid potential information explosion.Then,we propose deep learning based context vector extraction.Further more,we design a multi-task based structure to extract context vector that contains multiple factors(such as speaker and mono-phone)and a phone-aware structure to extract more pure speaker representation.The DNN based feature and i-vector are used for speaker-aware training based RNN adaptation,6.5% relative improvement is obtained on meeting transcription task AMI.At last,we extend the concept of context to the genre in RNN based language model(LM)and investigate genre-aware training based RNNLM adaptation.On the BBC multi-genre show transcription task,proposed method obtains significant word error rate(WER)reduction compare to the non-adapted model.Then,we further investigate a DNN based online adaptation method.We propose a novel DNN based multi-factor aware joint training framework.This approach is a structured model which integrates several different functional modules into one computational deep model.We explore and extract speaker,phone and environment factor representations using DNNs,which are integrated into the main ASR DNN to improve classification accuracy.All model parameters,including those in the ASR DNN and factor extraction DNNs,are jointly optimized under the multitask learning framework.Our approach requires no explicit separate stages for factor extraction and adaptation.The proposed method is evaluated on two main noise-robust tasks AMI and Aurora4,experiments on both tasks show that the proposed model can significantly reduce relative 10%-18% WER.Structured deep model for adaptive training depends on cluster adaptive training.In this thesis,we extend it to DNN: for one layer in DNN,we propose to use multiple matrix to form a weight matrix basis;a interpolation vector is estimated to combine the matrix basis into a context dependent weight matrix.Since only the interpolation vector is need to estimated during adaptation,the amount of adaptation parameter is much less than that in previous works so it can use limited adaptation data efficiently.Further more,we prove that context-aware training based DNN adaptation is equivalent to use a bias basis,thus,it can be treated as a special case of this framework.Proposed cluster adaptive training based DNN adaptation was evaluated on English Switchboard task,a significant relative 7.6%-10.6% improvement is obtained.Finally,the thesis proposed adaptive very deep convolutional residual network.Proposed two structured deep learning based methods are further extended into VDCRN,we solved the problem of concatenating 2D input with a vector.In additional,we extend cluster adaptive training to CNN and investigate using different bases.We propose factorized structure to model multiple factors simultaneously.Proposed methods obtained 5.92% WER on Aurora4,this is the state-of-the-art performance on this task.Finally,we propose a multi-pass decoding system to combine proposed structured deep learning based feature adaptation and structured deep model for adaptive training.The system is evaluated on three noise-robust speech recognition tasks Aurora4,Chime4 and AMI,the performance of proposed ASR system is close to that of human on Aurora4 and obtains 10%-39% relative improvement on Chime4 and AMI task.In conclusion,this thesis successfully applies structured deep learning in feature adaptation and model adaptation for automatic speech recognition,and achieves significant performance improvements in telephony speech recognition tasks and noise-robust speech recognition tasks.Especially on Aurora4,it is the state-of-the-art system.
Keywords/Search Tags:noise-robust speech recognition, structured deep learning, speaker adaptation, context-aware training, multi-task learning, cluster adaptive training
PDF Full Text Request
Related items