Font Size: a A A

Research On End-to-end Tibetan Speech Recognition Based On Deep Learnin

Posted on:2024-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2568307085970799Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the rapid and accurate recognition of speech signals by machines has become the foundation of human-computer interaction.Speech recognition technology,as one of the important technologies in human-computer interaction,is gradually being applied to minority languages.Tibetan is the main language in the daily life of the Tibetan people,belonging to the Tibetan branch of the Tibeto Burmese language family in the Sino Tibetan language family.The popular Tibetan dialect in Tibet is the Wei Tibetan dialect represented by Lhasa,which has also been the main focus of related research and achievements in the past.Relatively speaking,there is little research on the Ando dialect of the Tibetan language,so this article focuses on the Ando dialect.With the continuous deepening of deep learning,compared to traditional speech recognition methods,end-to-end speech recognition methods do not require the construction of pronunciation dictionaries,do not require linguistic knowledge,and have strong transferability.Currently,mainstream languages represented by Chinese and English gradually surpass traditional speech recognition methods in terms of end-to-end speech recognition performance.Therefore,this article adopts end-to-end speech recognition methods to study the Anduo dialect of Tibetan,The main research content is as follows:1.Design three modeling units for Tibetan language.According to the unique two-dimensional character structure of Tibetan,its writing involves two directions: left,right,and up and down.In response to this characteristic,this article has completed the design of Tibetan text preprocessing methods and modeling methods.Based on the granularity of the split,this study divides the modeling units into syllable level,word level,and component level.In order to overcome the problem of incomplete and inconsistent modeling units in current Tibetan speech recognition research,with the help of Tibetan international coding and laboratory Tibetan students,this article has completed the comprehensive design and extraction of syllables,word structures,and components,and constructed corresponding dictionaries.2.Construct a phonetic corpus of the Anduo dialect in Tibetan.Through the sharing of resources outside the school by mentors,assistance from Tibetan students in the laboratory,and inviting Tibetan personnel proficient in the Anduo dialect to record and annotate,a Tibetan Anduo dialect voice dataset was established.Each voice data has a sequence number followed by a text label,which can effectively describe the authenticity and complexity of the Anduo dialect.Through web crawler,back translation,text enhancement and other methods,a Tibetan Anduo dialect corpus is established,and two language models under different corpora are trained for model performance enhancement.3.Implement end-to-end Tibetan Ando dialect speech recognition.This article constructs an end-to-end Tibetan speech recognition model based on Connectionist temporal classification(CTC),Transformer,and Bi Transformer.The encoder of the model uses Transformer,while the decoder uses CTC and Bi Transformer.After the CTC hard alignment is completed,language models and re scoring strategies are added to improve the recognition accuracy of the model.Among them,the language model compensates for the defect of CTC independence assumption,and the re scoring strategy achieves secondary scoring of candidate sequences,fully utilizing speech signals and text sequences.The experimental results show that adding a language model can significantly improve the performance of speech recognition models,and among the four decoding methods,the re scoring strategy performs better than the other three decoding methods.The experimental results show that when using Tibetan syllables as modeling units for 1000 hours of data,the Tibetan Anduo dialect speech stream recognition effect is better.When using Tibetan components as modeling units for 600 hours or less of data,the effect is better.When using Tibetan word Ding as modeling unit for 600-800 hours of data,the effect is not significantly different from syllables.4.Build an end-to-end Tibetan speech stream recognition model.Based on the implementation of speech recognition for the Tibetan Ando dialect,this article implements flow recognition for the Tibetan Ando dialect.This article utilizes a dynamic block attention mechanism algorithm based on the current frame,which can recalculate the size of the block attention range during batch loading,thereby implementing the dynamic block mechanism and achieving streaming decoding by changing the size of the block attention range.The experimental results show that when the block attention range is selected as 16,the performance of the four decoding methods is better.In this paper,the effectiveness of Tibetan syllables,characters,and components as modeling units was verified through experiments,and the problem of Tibetan Anduo dialect flow recognition was solved.By comparing and analyzing the performance effects of three modeling units under different data volumes,this provides ideas and references for future Tibetan speech recognition researchers.
Keywords/Search Tags:speech recognition, tibetan amdo dialect, end-to-end model, attention
PDF Full Text Request
Related items