End-to-End Speech Recongntion Research Based On Information Aggregation

Posted on:2024-06-09

Degree:Master

Type:Thesis

Country:China

Candidate:P Liu

Full Text:PDF

GTID:2568306914972489

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

In recent decades,deep neural network technology has been widely used in speech recognition.The end-to-end speech recognition algorithm represented by Transformer adopts the sequential-to-sequence mode,integrating the traditional acoustic model and language model into the encoder and decoder.End-to-end speech recognition technology simplifies the intermediate steps in the process of recognition and improves the accuracy of speech recognition.Despite the success of the Transformer in the field of speech recognition,there are still shortcomings in the way of modeling speech signals.It mainly shows that the model neglects the dependence relationship between different scales primitive,which may lead to the loss of speech information.In addition,autoregression-based decoders need to generate target characters sequentially one by one.In realtime speech interaction scenarios,the slow decoding speed will become an important factor limiting the performance of speech recognition models.Therefore,this article aims to address the shortcomings in speech signal modeling and slow decoding speed of the model to improve the accuracy and real-time performance of speech recognition.The main work of this paper includes the following two aspects:1.Research on speech recognition model based on information convergence.The end-to-end model based on Transformer adopts frame-level granularity modeling primitives,which makes it difficult to mine multilevel information in speech.In order to solve this problem,this paper proposes a information aggregation method to gradually expand the modeling granularity of speech primitive.By merging and aggregating frame-level primitives,the interdependent relationships between different granularity modeling primitives in speech are mined.At the same time,the multi-level information of speech is introduced into the decoding process,so as to expand the search range of decoding.Evaluation results on the Aishell dataset show that the proposed method reduces the word error rate by 0.75%compared with the baseline model.2.Research on speech recognition model based on non-regression decoding.In order to improve the decoding real-time rate,a method of constructing target text length predictor is proposed in this paper.The predictor can predict the length of the target text and use this information to generate blank cells with the same length as the target text in nonautoregressive decoding as model input.In addition,the transfer learning method is adopted in this paper to transfer the parameters of the autoregressive model to the non-autoregressive model.The initial input of decoder is synthesized according to acoustic context by designing a target text layout predictor.Experimental verification shows that compared with the autoregressive model,the proposed non-regressive decoding method reduces the real time rate of the model by 45.45%and has a higher recognition accuracy.To sum up,this paper proposes two methods to optimize the end-toend model,namely speech recognition model optimization method based on information convergence and speech recognition model research based on non-regression decoding.These two methods are designed to solve the problem of the deficiency of speech signal modeling methods and the slow decoding speed,and have achieved good results.The information convergence method introduces multi-level speech information to improve the recognition rate of the model.The non-autoregressive speech recognition framework based on length prediction improves the decoding performance of the model in real-time scenes,while maintaining a high recognition level.Experimental results verify the effectiveness of the proposed method.

Keywords/Search Tags:

speech recognition, Transformer, information aggregation, non-autoregressive

PDF Full Text Request

Related items

1	Research On Speech Recognition Based On Transformer
2	Research On Continuous Speech Recognition System Based On Transformer
3	Research On End-to-End Non-Autoregressive Model-Based Amdo Tibetan Speech Synthesis Technology
4	Research On Chinese Speech Recognition Technology Based On BPE And Transformer
5	Research And Application On Speech Recognition For Complex Scenes
6	Research On Amdo Tibetan Speech Recognition Technology Based On MRDCNN＿CTC＆Transformer Transformer
7	Automatic Speech Recognition And Hotword Enhancement Algorithm Based On Transformer
8	Research On Vietnamese Speech Synthesis Technology Based On End-to-End
9	Transformer tunnels and their application to aggregation in IP networks
10	Mandarin Automatic Speech Recognition Based On Transformer