Font Size: a A A

End-to-end Nanchang Dialectal Speech Recognition Based On Deep Learning

Posted on:2024-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:G JiangFull Text:PDF
GTID:2568307100980019Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development of deep learning,the recognition accuracy of speech recognition systems in various languages has been greatly improved,such as English and Chinese.Mandarin speech recognition has also reached a high level,largely meeting the needs of daily communication.However,in the context of the widespread distribution of various dialects in China,it is difficult to build a speech recognition model for dialects due to the large variety of dialects and the corresponding small amount of data.Therefore,it is necessary to seek more effective methods to study dialect speech recognition.This article focuses on how to use limited dialect speech resources to improve the performance of Nanchang dialect speech recognition system.The main work is as follows:Firstly,the paper analyzes the characteristics of Nanchang dialect and constructs a dataset of Nanchang dialect.We convened six local volunteers from Nanchang to record Nanchang dialect against specific texts.The recording file was edited,cut,and corrected to obtain a total of 18.2 hours,totaling 13988 voice data from Nanchang dialect.Secondly,this article has conducted in-depth research on various end-to-end speech recognition models.Referring to these achievements,we have built an end-to-end speech recognition model similar to the RNN-T model,called Conformer-Transducer.In this model,we use Conformar as the acoustic encoder of the RNN-T structure and BLSTM as its label encoder.The Conformer model performs well in various speech recognition tasks,combining the advantages of Transformer and convolutional neural networks-Transformer model is good at capturing global information of content,while convolutional neural networks are good at extracting local features of content.In the experiment,by setting different main parameters of the model,such as the number of Conformer encoders,the number of multiple heads of the Conformer module,and the size of the convolutional core of the Conformer module,comparative experiments were conducted,and the best parameter of the Conformer-Transducer model was selected.Then,different end-to-end speech recognition models were trained using AISHELL-1 and aidatatang_200zh datasets,and the comparison results showed that the Conformer-Transducer end-to-end speech recognition model had the best effect on the Mandarin dataset.Finally,transfer learning is applied to train the end-to-end Nanchang dialect speech recognition model.First,the collected dialect data is processed at variable speeds to achieve the purpose of data expansion.Then,different schemes were adopted to fine-tune the model,and the optimal recognition effect was achieved through fine-tuning each module of the model.The Character error rate on the Nanchang dialect test set was 12.6%.
Keywords/Search Tags:dialectal speech recognition, low resource, Conformer, RNN-T, fine-tuning
PDF Full Text Request
Related items