| In recent decades,deep neural network technology has been widely used in speech recognition.The end-to-end speech recognition algorithm represented by Transformer adopts the sequential-to-sequence mode,integrating the traditional acoustic model and language model into the encoder and decoder.End-to-end speech recognition technology simplifies the intermediate steps in the process of recognition and improves the accuracy of speech recognition.Despite the success of the Transformer in the field of speech recognition,there are still shortcomings in the way of modeling speech signals.It mainly shows that the model neglects the dependence relationship between different scales primitive,which may lead to the loss of speech information.In addition,autoregression-based decoders need to generate target characters sequentially one by one.In realtime speech interaction scenarios,the slow decoding speed will become an important factor limiting the performance of speech recognition models.Therefore,this article aims to address the shortcomings in speech signal modeling and slow decoding speed of the model to improve the accuracy and real-time performance of speech recognition.The main work of this paper includes the following two aspects:1.Research on speech recognition model based on information convergence.The end-to-end model based on Transformer adopts frame-level granularity modeling primitives,which makes it difficult to mine multilevel information in speech.In order to solve this problem,this paper proposes a information aggregation method to gradually expand the modeling granularity of speech primitive.By merging and aggregating frame-level primitives,the interdependent relationships between different granularity modeling primitives in speech are mined.At the same time,the multi-level information of speech is introduced into the decoding process,so as to expand the search range of decoding.Evaluation results on the Aishell dataset show that the proposed method reduces the word error rate by 0.75%compared with the baseline model.2.Research on speech recognition model based on non-regression decoding.In order to improve the decoding real-time rate,a method of constructing target text length predictor is proposed in this paper.The predictor can predict the length of the target text and use this information to generate blank cells with the same length as the target text in nonautoregressive decoding as model input.In addition,the transfer learning method is adopted in this paper to transfer the parameters of the autoregressive model to the non-autoregressive model.The initial input of decoder is synthesized according to acoustic context by designing a target text layout predictor.Experimental verification shows that compared with the autoregressive model,the proposed non-regressive decoding method reduces the real time rate of the model by 45.45%and has a higher recognition accuracy.To sum up,this paper proposes two methods to optimize the end-toend model,namely speech recognition model optimization method based on information convergence and speech recognition model research based on non-regression decoding.These two methods are designed to solve the problem of the deficiency of speech signal modeling methods and the slow decoding speed,and have achieved good results.The information convergence method introduces multi-level speech information to improve the recognition rate of the model.The non-autoregressive speech recognition framework based on length prediction improves the decoding performance of the model in real-time scenes,while maintaining a high recognition level.Experimental results verify the effectiveness of the proposed method. |