Font Size: a A A

Chinese Speech Recognition Method And Application Based On End-to-End Model

Posted on:2024-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ShenFull Text:PDF
GTID:2568307127955039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The continuous advancement of technology has made speech recognition a vital interdisciplinary branch that connects humans and machines by integrating computer science and computer linguistics.Since the 1960 s,artificial neural networks have fueled speech recognition research,and the introduction of deep neural networks has further accelerated the development of end-to-end speech recognition technology.With the development of deep neural networks,end-to-end speech recognition technology has developed rapidly.End-to-end speech recognition has many advantages over traditional speech recognition models.As it does not require the use of multiple modules in traditional speech recognition systems(such as acoustic models,language models,and pronunciation models)for processing and conversion,it simplifies the speech recognition process,reduces many tedious steps,and improves accuracy and efficiency.In addition,end-to-end speech recognition can also improve the real-time performance of speech recognition,making it able to respond to user commands faster.Although the end-to-end model has achieved certain results in the field of speech recognition,there are still some defects.This dissertation conducts the following research on the current problems of end-to-end speech recognition.First of all,the end-to-end model Transformer fails to perform specific structural optimization for speech recognition tasks.Furthermore,since Transformer focuses on global information,it ignores the capture of some local information,resulting in difficulty for the model to obtain complete feature information.Secondly,most neural network models in the field of speech recognition rely on a large number of parameters and huge model sizes to support high performance.However,such methods reduce the inference speed of the model and make it difficult to deploy deep networks on lightweight devices.Finally,some speech recognition models have poor recognition performance in practical scenarios,which differs greatly from the training phase.Firstly,this dissertation proposes the LCN-Transformer to address the difficulty of optimizing the Transformer in end-to-end deep neural network models and the challenge of capturing local information.The LCN-Transformer includes a convolution module in the encoder,which enhances the model’s local feature ability.Additionally,we improve the recognition accuracy of the model by adjusting the structural position of the normalization layer in the encoder and decoder.Finally,by changing the activation function of the feed-forward neural network,we improve the model’s recognition ability.Experimental tests show that the proposed model outperforms the baseline model,achieving 23% and 33% improvement in recognition accuracy on different datasets.These results demonstrate the LCN-Transformer’s generalization and advancement.Secondly,in view of the fact that the deep neural network relies on a large number of layer stacks of the model to reduce the reasoning speed of the model and the slow feature capture speed,etc.,this dissertation proposes a lightweight Chinese speech recognition model LMTransformer combined with Transformer.First,we use depthwise separable convolution to obtain audio feature information.Second,a double-half-step residual weight feed-forward neural network is constructed,and a low-rank matrix decomposition is introduced to achieve model compression while ensuring recognition accuracy.Final,the sparse attention mechanism is used to improve the decoding speed of the model.In experiments on different datasets,it has been shown that the proposed model improves inference speed while ensuring high recognition accuracy,and has certain generalization ability.Finally,this dissertation aims to address the limitations of some speech recognition models,which are only theoretical,have poor performance in real-world environments,do not integrate well with other natural language technologies,and have complex practical applications.To overcome these limitations,we take deep learning and speech recognition technology as foundation,and integrate speech recognition with natural language technology.To demonstrate the practical application of our approach,we develop a voice chat dialogue system and realize functions such as online real-time recording,speech recognition,and daily chat of users.In conclusion,this dissertation optimizes the structure of the speech recognition task on the end-to-end model Transformer,which improves the recognition accuracy of the model.Additionally,this dissertation proposes a lightweight deep neural network speech recognition model that overcomes the challenges of current deep models,such as large model size and slow reasoning speed.Finally,we develop an effective voice chat dialogue system based on the proposed method.
Keywords/Search Tags:Speech Recognition, End-to-End Model, Transformer, Lightweight Method, Speech Dialogue System
PDF Full Text Request
Related items