| With the development of deep learning theory and the improvement of computing equipment’s computing power,end-to-end speech recognition has played a significant role in all kinds of speech recognition methods.The end-to-end approach can directly establish the mapping relationship between speech feature sequence and output text and does not need frame-level alignment annotation.It improves the recognition accuracy and further simplifies the modeling process.Aiming at the problems of a large number of parameters and high computational complexity of end-to-end speech recognition model,this paper proposes a deep encoder-decoder network based on transformer structure,which not only ensures high recognition accuracy but also dramatically reduces the number of parameters and the computational complexity of the model,which is convenient for the lightweight deployment of the model.The main work of this paper is as follows:1.A transformer encoder network is designed based on a "local-global" attention fusion mechanism.By introducing a learnable parametric mask function into the local dense synthetic attention,a local attention mechanism based on the adaptive mask is proposed to dynamically learn the optimal range of local attention and complete the extraction of short-term local features of speech signals;By studying the influence of global self-attention mechanism and adaptive mask local attention mechanism on the accuracy of model recognition under different topologies,an optimal fusion attention mechanism of "local-global" cascade topology is proposed;The proposed fusion attention mechanism is replaced by the self-attention mechanism in transformer encoder network to obtain an improved encoder network.2.A decoder network based on hierarchical grouping linear transform is proposed.By using different sizes of grouping feedforward networks,a lightweight "expansion and scaling" unit based on hierarchical grouping linear transformation is established;Using the block by block scaling strategy,each network block in the transformer decoder is embedded with "expansion scaling" units under different parameter configurations to obtain the decoder network with increasing depth and width;An improved lightweight transformer deep codec network is obtained by combining the transformer encoder network with " local-global" attention mechanism and the decoder network based on hierarchical grouping linear transformation.The improved transformer encoder network proposed in this paper achieves a word error rate of 5.65% on aishell-1 Chinese Mandarin data set;The improved lightweight transformer depth encoder-decoder network achieves an error rate of 5.99% and 11.06%with 19.9M and 19.6M parameters,respectively on aishell-1 data set and ted-lium2 English data set,which is better than other comparison methods. |