| With the rapid development of the internet industry,speech recognition technology is used more and more widely in our life.Due to the simplicity and effectiveness of the attention-based encoder-decoder end-to-end ASR model(AED model),it has been widely used and is becoming the focus of researchers.But AED models are inherently limited in taking full advantage of an external language model(LM)that is trained on a much larger text-only data.When fused with an external language model the accuracy improvement of it is usually as low as 10% while a traditional ASR model is usually 30%-40%.This is not only because the AED model is already much more accurate than the traditional ASR model,but also because the AED model inevitably learns a biased internal language model(ILM)during training.Theoretically better recognition accuracy can be achieved by removing the impact of ILM when fused with an external language model.The difficulty lies in that we cannot easily estimate the ILM because it hides inside the AED model.Many methods have been proposed to estimate the implicit ILM.One of the most effective methods is proposed by Microsoft called Zero-out.But it can only estimate the ILM from AED models with BLSTM encoder and reduce the WER of the fused model.It cannot be applied to all kinds of AED models because of mismatch.In addition,the estimation and subtraction operations require much more computational resources than the traditional shallow fusion method during inference.Thus,despite the significant accuracy improvement,they are not widely used in industry.To deal with these issues,the main research contents and contributions of this thesis are as follows:1.We find that the Zero-out method is not suitable for certain types of AED models due to the mismatch problem.We proposed two training-based methods to solve it.So,the proposed methods can be applied to any type of AED model.Experimental results demonstrated that the proposed methods consistently outperform the Zero-out method.The proposed method can reduce WER by up to 33% relatively in the Librispeech test dataset.2.To reduce the computational requirement during inference.We proposed a training method to directly eliminate the linguistic bias during training instead of subtracting it during inference.This is achieved by using adversarial learning.The model trained with the proposed method will have less linguistic bias and can achieve comparative accuracy with much less computational resource. |