Font Size: a A A

Research On Speech Recognition Based On Self-supervised Model

Posted on:2024-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:S R LiFull Text:PDF
GTID:2568307055497964Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Owing to the advent of deep neural networks,significant advancements have been achieved in the realm of automatic speech recognition technology.Presently,cuttingedge speech recognition systems boast accuracy rates nearing human levels in certain contexts.Nevertheless,these systems necessitate vast quantities of labeled data for training,limiting the applicability of speech recognition technology in low-resource languages.Recently,research on self-supervised speech representation models has gathered momentum,solely utilizing audio data for pre-training without the need for corresponding text labels,and subsequently exhibiting remarkable performance in a myriad of downstream tasks.Within this framework,this study delves into speech recognition tasks based on the currently prevalent self-supervised models wav2vec2.0and Hu BERT.The primary contributions and innovations of this work are as follows:Firstly,a novel supervised speech recognition approach is employed.This method is bifurcated into two components: the initial phase encompasses pre-training the selfsupervised model,leveraging a contrastive loss function to glean sophisticated speech features from copious amounts of unlabeled audio data.Subsequently,the second phase comprises fine-tuning based on the self-supervised model,utilizing it as an encoder followed by a CTC decoder,while employing a modicum of labeled data for fine-tuning.Relying on both wav2vec2.0 and Hu BERT self-supervised models,experimental validation has been executed on TIMIT datasets and Yongning Mosuo language datasets.Experimental outcomes reveal that,in comparison with conventional supervised speech recognition techniques,the fine-tuning method of self-supervised models achieves the lowest word error rate of 15.3% and necessitates a shorter training duration.Second,a baseline system for unsupervised speech recognition is constructed.The process is divided into two parts: one is audio and text preprocessing,using a selfsupervised pre-training model to extract advanced speech features,replacing traditional acoustic features,and then performing clustering and dimensionality reduction on speech features.Convert text into phoneme sequences and generate one-hot vectors.The second is unsupervised training,where the preprocessed speech features are fed into the generator,which outputs a sequence of pseudophoneme vectors,which are then fed into the discriminator along with the unpaired text vectors.The goal of unsupervised training is to make it difficult for the discriminator to distinguish whether the input comes from pseudophone vectors or real text vectors.Experimental results show that unsupervised speech recognition achieves a phoneme error rate of 20.06% on TIMIT data.However,on the Yongning Mosuo language corpus,due to the scarcity of text data and the influence of self-supervised models,the recognition effect is not ideal.Finally,an unsupervised speech recognition system is improved.The first is to propose a speaker normalization method for the self-supervised model,which conducts qualitative and quantitative analysis of the advanced speech features extracted by the self-supervised model,and removes the influence of speaker information on the speech recognition task.The second is to introduce a reconstruction loss,so that the output of the generator can be reconstructed back to the original input audio,making full use of the original audio information.The experimental results on the TIMIT data show that compared with the baseline system,the phoneme error rate drops by more than 5%,and the training is more stable.
Keywords/Search Tags:speech recognition, self-supervised model, model fine-tuning, unsupervised learning
PDF Full Text Request
Related items