Research On Conversational Speech Recognition Technology In Low-Resource Scenario

Posted on:2024-09-11

Degree:Master

Type:Thesis

Country:China

Candidate:G L Zhong

Full Text:PDF

GTID:2568306932455774

Subject:Information and Communication Engineering

Abstract/Summary:

Conversational speech recognition,which converts oral conversations into text,has widespread applications in meeting,customer service and other scenarios.However,non-standard pronunciation and colloquial expressions make it difficult for general speech recognition models to work well in conversational scenarios.Besides,collecting and annotating real conversational speech is challenging and costly.Improving the recognition performance with limited training data remains a challenge in speech technology research.In this dissertation,we focus on this topic and build an end-toend speech recognition system based on a pretrained model to address the overfitting problem.The following two research contents were carried out:To fully utilize the limited transcribed data,we propose a training method based on long context.Traditional speech recognition models only recognize individual utterances,disregarding the contextual information between them and lacking generalization for long-form speech.To address this problem,context concatenation data augmentation is introduced to expand the context of utterance and augment training dataset.We also propose context-aware training,where we encode historical sentence text and integrate it into the model using start token fusion and cross-sentence attention fusion.Experiments show significant improvements in model recognition accuracy,and the combination of both methods further enhances performance.The scarcity of transcribed data hampers the performance of the model,thus we study leveraging external text data to improve the model’s performance.Firstly,we propose a method to obtain conversational-style text data from the web.Secondly,we propose a joint text-task training method,optimizing the decoder with an auxiliary text denoising task and constructing the input of the cross-attention layer of the decoder by using a text encoder.Furthermore,we propose synthesized speech data augmentation,generating speech data corresponding to external text using a text-to-speech model,and mixing it with real data for training to optimize the overall model.Our experimental results demonstrate the effectiveness of both methods in improving recognition performance,with synthesized speech data augmentation showing the most significant improvement.

Keywords/Search Tags:

Low-resource, Conversational speech recognition, Long context, External text, Data augmentation

Related items

1	Research On Application Of Data Augmentation Based On Different Speech Habits In Speech Recognition In Telephone Scene
2	Research On Data Augmentation Technology For Speech Recognition Application
3	Design And Implementation Of Handwritten Chinese Character Recognition Platform Based On Text Recognition
4	Research And Application Of Text Error Detection And Correction After Speech Recognition
5	Text-to-Image Synthesis Models In Low-resource Scenarios
6	Research On Speech Recognition Technology In Low Resource Environment
7	Research On Speaker Recognition In Conversational Speech
8	Speech Emotion Recognition With Deep Learning Techniques And Data Augmentation
9	Key Technology Study Of Sociable Conversational Recommendation System
10	Research Of Long Speech And Text Alignment