Font Size: a A A

Research On Conversational Speech Recognition Technology In Low-Resource Scenario

Posted on:2024-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:G L ZhongFull Text:PDF
GTID:2568306932455774Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Conversational speech recognition,which converts oral conversations into text,has widespread applications in meeting,customer service and other scenarios.However,non-standard pronunciation and colloquial expressions make it difficult for general speech recognition models to work well in conversational scenarios.Besides,collecting and annotating real conversational speech is challenging and costly.Improving the recognition performance with limited training data remains a challenge in speech technology research.In this dissertation,we focus on this topic and build an end-toend speech recognition system based on a pretrained model to address the overfitting problem.The following two research contents were carried out:To fully utilize the limited transcribed data,we propose a training method based on long context.Traditional speech recognition models only recognize individual utterances,disregarding the contextual information between them and lacking generalization for long-form speech.To address this problem,context concatenation data augmentation is introduced to expand the context of utterance and augment training dataset.We also propose context-aware training,where we encode historical sentence text and integrate it into the model using start token fusion and cross-sentence attention fusion.Experiments show significant improvements in model recognition accuracy,and the combination of both methods further enhances performance.The scarcity of transcribed data hampers the performance of the model,thus we study leveraging external text data to improve the model’s performance.Firstly,we propose a method to obtain conversational-style text data from the web.Secondly,we propose a joint text-task training method,optimizing the decoder with an auxiliary text denoising task and constructing the input of the cross-attention layer of the decoder by using a text encoder.Furthermore,we propose synthesized speech data augmentation,generating speech data corresponding to external text using a text-to-speech model,and mixing it with real data for training to optimize the overall model.Our experimental results demonstrate the effectiveness of both methods in improving recognition performance,with synthesized speech data augmentation showing the most significant improvement.
Keywords/Search Tags:Low-resource, Conversational speech recognition, Long context, External text, Data augmentation
PDF Full Text Request
Related items