Speech Recognition Based On Deep Encoder And Decoder

Posted on:2023-11-27

Degree:Master

Type:Thesis

Country:China

Candidate:J W Cheng

Full Text:PDF

GTID:2568307031988319

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the development of deep learning theory and the improvement of computing equipment’s computing power,end-to-end speech recognition has played a significant role in all kinds of speech recognition methods.The end-to-end approach can directly establish the mapping relationship between speech feature sequence and output text and does not need frame-level alignment annotation.It improves the recognition accuracy and further simplifies the modeling process.Aiming at the problems of a large number of parameters and high computational complexity of end-to-end speech recognition model,this paper proposes a deep encoder-decoder network based on transformer structure,which not only ensures high recognition accuracy but also dramatically reduces the number of parameters and the computational complexity of the model,which is convenient for the lightweight deployment of the model.The main work of this paper is as follows:1.A transformer encoder network is designed based on a "local-global" attention fusion mechanism.By introducing a learnable parametric mask function into the local dense synthetic attention,a local attention mechanism based on the adaptive mask is proposed to dynamically learn the optimal range of local attention and complete the extraction of short-term local features of speech signals;By studying the influence of global self-attention mechanism and adaptive mask local attention mechanism on the accuracy of model recognition under different topologies,an optimal fusion attention mechanism of "local-global" cascade topology is proposed;The proposed fusion attention mechanism is replaced by the self-attention mechanism in transformer encoder network to obtain an improved encoder network.2.A decoder network based on hierarchical grouping linear transform is proposed.By using different sizes of grouping feedforward networks,a lightweight "expansion and scaling" unit based on hierarchical grouping linear transformation is established;Using the block by block scaling strategy,each network block in the transformer decoder is embedded with "expansion scaling" units under different parameter configurations to obtain the decoder network with increasing depth and width;An improved lightweight transformer deep codec network is obtained by combining the transformer encoder network with " local-global" attention mechanism and the decoder network based on hierarchical grouping linear transformation.The improved transformer encoder network proposed in this paper achieves a word error rate of 5.65% on aishell-1 Chinese Mandarin data set;The improved lightweight transformer depth encoder-decoder network achieves an error rate of 5.99% and 11.06%with 19.9M and 19.6M parameters,respectively on aishell-1 data set and ted-lium2 English data set,which is better than other comparison methods.

Keywords/Search Tags:

automatic speech recognition, attention mechanism, lightweight neural network, end-to-end

PDF Full Text Request

Related items

1	Research And Design Of Lightweight Anti-noise Speech Recognition Algorithm
2	Acoustic Model Of Speech Recognition Based On Lightweight Neural Network And Its Application In Robot
3	Research On End-to-End Speech Recognition Method Based On Self-Attention Mechanism
4	Research On Lightweight Speech Separation Method Based On Attention Mechanism
5	Speech Emotion Recognition Based On Neural Network And Attention Mechanism
6	Application And Optimization Of Lightweight Neural Network In Image Recognition
7	Research On Speech Emotion Recognition Model Based On Deep Neural Network
8	Research On Behavior Recognition Method Based On Lightweight And Global Frequency Domain Poolin
9	Research And Application Of Attention-based Mandarin Speech Recognition
10	Facial Expression Recognition System Based On Lightweight Convolutional Neural Network