| Chinese,with its abundant words,unique tone representation,and special rhyme structure,is one of the most widely used languages in the world,making the study of Chinese speech recognition increasingly difficult.Speech recognition technology’s continual advancement has made it a vital part of people’s daily lives,particularly in human-computer interactions,intelligent customer service,and other areas,and its utilization is now more and more widespread.The swift advancement of artificial intelligence technology has rendered intelligent devices a necessity for everyday life,and human-computer interaction technology is of paramount importance in this.The replacement of traditional control methods,such as touch,gestures and keyboards,with voice interaction is now the most essential human-computer interaction technology.Speech recognition technology is the initial step in human speech communication,and has made tremendous strides in this area,however,it also has its own set of difficulties.(1)In view of the problem that the traditional acoustic model is complex and cannot be trained,and the data must be pre-aligned,we optimized the CNN + Bi LSTM + CTC speech recognition framework by introducing gated recurrent unit(GRU)and bidirectional gated recurrent unit(Bi GRU)based on the original network structure.This model uses one-dimensional CNN combined with context information for feature extraction to improve feature expression ability,and to realize end-to-end speech recognition using CTC technology.Try to verify the impact of different audio features on the system performance with different system input features.As a variant of recurrent neural network,gated recurrent unit(GRU),Bi GRU can capture features in both directions,which can better extract the implicit information of temporal data.The combination of the two greatly improves the feature extraction ability.Experimental results on dataset AISHELL-1 show a significant improvement in speech recognition performance compared to all baseline models,with a character error rate of 13.8% on the test set.(2)Aiming at the low recognition rate,large number of parameters and long training time in the training task,the CNN-Bi GRU speech recognition algorithm integrating attention mechanism is proposed.This method first normalized and preprocesses the dataset,extracts the spectrogram of the audio signal and converting it into spectrogram as input.Secondly,Constructing a complex deep learning model,combining convolutional neural network(CNN)and bidirectional gated recurrent unit(Bi GRU),convoltional neural network(CNN)is used to capture local features,and temporal features are extracted through bidirectional gated recurrent unit(Bi GRU),and attention mechanism is added.The local and temporal features are then transmitted to the attention mechanism layer to focus on information related to speech features and suppress useless information,and output through the fully connected layer.Finally,the model was appraised with various assessment indices,and compared to other base models,utilizing the opensource dataset AISHELL-1.The CNN-Bi GRU speech recognition algorithm integrating the attention mechanism has achieved good results and improved the performance of the speech recognition system. |