Font Size: a A A

Research On Low-Resource Speech Recognition Based On The Transfer Learning And Fusion Of Language Models

Posted on:2024-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiFull Text:PDF
GTID:2568307055998049Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous improvement of deep learning technology,speech recognition technology has also been rapidly developing.Currently,there are two frameworks for acoustic modeling in speech recognition: hybrid and end-to-end architectures.The end-to-end architecture trains and optimizes the objective function through a single network,avoiding the modular design and independence assumptions of the hybrid architecture,and has the characteristics of joint optimization and ease of deployment.However,the end-to-end architecture also has two problems: the need for a large amount of annotated data and noise and dialect issues.These two problems greatly limit the development of low-resource language speech recognition with insufficient speech annotation data.This article focuses on these two key issues of the end-to-end architecture and aims to improve the recognition performance of low-resource language speech recognition in various fields,mainly using methods such as transfer learning and language model fusion for optimization.This article takes the Tibetan Amdo dialect as the research object of low-resource language and mainly studies three aspects: modeling unit selection under the end-to-end architecture,transfer learning optimization,and language model fusion optimization.The specific work is as follows:1.How to select modeling units under the end-to-end architecture is a key issue.This article proposes a method to improve the modeling unit using byte pair encoding algorithm for low-resource language end-to-end speech recognition.Traditional modeling units are based on Tibetan syllables or Tibetan letters,but due to limited training data,traditional modeling units will have many out-of-vocabulary or information loss problems.Therefore,this article proposes to use the byte pair encoding algorithm to automatically generate modeling units,generate automatically learned modeling units by merging Tibetan letters that appear more frequently in the text.Experimental results show that using byte pair encoding modeling units can learn more acoustic features,and the modeling units are far fewer than traditional modeling units,which can significantly improve speech recognition performance,and the best result is relative to syllable modeling.It achieved a 26.81% performance improvement.2.This article proposes an end-to-end speech recognition method based on selfsupervised feature extraction and transfer learning to alleviate the problem of inadequate performance due to insufficient training data in low-resource speech recognition.Specifically,at the feature extraction level,this article chooses the Hubert model trained in Mandarin as a tool to explore the optimization effect of self-supervised model feature extraction on low-resource speech recognition.At the model level,based on the idea of transfer learning,the parameters of the Mandarin training model are used to initialize the Amdo dialect model parameters.The results show that the method of self-supervised feature extraction and pre-training model initialization parameters achieved a relative 9.9% and 11.9% performance improvement compared to the baseline system.3.In the end-to-end speech recognition architecture,language model fusion has been proven to be an effective method.The end-to-end model can implicitly learn language information and use shallow fusion to utilize additional language model information.However,this method lacks mathematical support.In speech recognition,it is usually assumed that the source domain(i.e.,the training scenario)and the target domain(i.e.,the test scenario)share the same acoustic model.The end-to-end speech recognition model can be simply regarded as a combination of acoustic and language models.Therefore,based on the Bayesian method,this article explores the effect of using the density ratio method to fuse language models in low-resource speech recognition.In the target domain,the model score can be expressed as the source domain acoustic model score minus the internal language model score,plus the external language model score.The experimental results show that using the language model fusion method proposed in this article can significantly improve the recognition performance of low-resource language speech recognition.
Keywords/Search Tags:Self-supervised learning, Byte Pair Encoding (BPE), Transfer learning, Language model fusion, Tibetan
PDF Full Text Request
Related items