| Speech signal is the most direct carrier for transmitting information in interpersonal communication,and it is by virtue of this feature that speech becomes an important bridge for human-computer interaction.Since the 1970 s,automatic speech recognition technology has become a hot topic in the field of machine learning,attracting the attention of many scholars.Nowadays,intelligent devices that apply automatic speech recognition technology have widely entered people’s lives,bringing many conveniences.At this stage,traditional models based on hidden Markov technology are still widely used,but this traditional model is composed of several independent modules,the use process is complicated and the model structure is too complex,it is difficult to optimize the model as a whole.End-to-end speech recognition methods avoid the shortcomings of traditional methods,and can directly map acoustic features to text sequences.The model structure is simple and easy to optimize.Therefore,end-to-end models are gradually becoming the development direction in the future.However,end-to-end speech recognition methods still face many challenges,and how to improve the accuracy of model recognition while maintaining high efficiency has become a major challenge.Based on this background,this thesis focuses on end-to-end speech recognition technology and conducts research on Mandarin recognition.The main work of this article is as follows:(1)In view of the fact that the alignment of Attention model has no sequence restriction,blind alignment brings difficulties to training,while the output of CTCbased model at the current moment is only related to the input at the current moment,lacking contextual relevance.A hybrid CTC/Attention end-to-end model is proposed to give full play to the advantages of both.In the process of model training,a multitask learning method is adopted,and CTC is used as an auxiliary task to speed up alignment and decoding.At the same time,the advantage of the context modeling of Attention mechanism is brought into play,which can adjust the model more flexibly.The proposed method is verified on the open source data set.When the joint training parameter λ=0.3,the hybrid model based on multi-head attention mechanism has the best performance.Compared with the existing model,it has different degrees of reduction in the character error rate,which proves the effectiveness of the proposed method.(2)In order to solve the problem that the end-to-end model could not be used to jointly optimize the language model,an end-to-end model combining Conformer encoder and Transducer structure was proposed in this thesis.Conformer is constructed as an encoder of the model by adding a convolution module in Transformer,which improves the model’s ability to capture local fine information.Moreover,relative sinusoidal encoding is introduced into self-attention,so that the model can better adapt to different lengths of speech and further improve the generalization ability.The prediction network in the model solves the problem of conditional independence assumptions and the inability to achieve language modeling in CTC methods,achieving joint optimization of acoustic and linguistic information.A series of experiments on open source data sets show that the character error rate of the proposed model is reduced in different degrees compared with the current mainstream model,and can meet the standard of stream recognition,which proves the advanced nature of the model.(3)Due to the deepening of the model network hierarchy and the combination of different structures,the amount of model parameters increases,resulting in a large number of redundant parameters,which reduces the speed of model reasoning.Therefore,the model Conformer-Transducer is adjusted to improve the recognition efficiency.Firstly,a low-rank convolutional front end and a low-rank feedforward module are constructed using the low-rank decomposition method to remove redundant parameters in the Conformer encoder.Secondly,the Beam Search algorithm used for decoding is improved.By setting threshold,the path with low confidence is cut and only the valid candidate path is calculated to reduce the computational complexity.A series of ablation experiments and comparison experiments were carried out on the open source Mandarin Chinese data sets AISHELL-1 and aidatatang_200zh.It is proved that the proposed method can greatly improve the recognition efficiency while maintaining a low character error rate,demonstrating the generalization ability and advanced nature of the model. |