| Language is the foundation of human communication,building bridges for people to study,work and life,and the way for machines to understand human language is speech recognition technology.In order to understand the work situation of employees better,many companies hope that they can report and summarize their work regularly.If they can use speech recognition technology to convert these voices into text,it is convenient to store and keep files.At the same time,when the voice cannot be played,you can watch recognized text.Speech recognition consists of an acoustic model and a language model.The acoustic model takes speech as input and converts it into a sequence of pinyin,and the language model converts the input pinyin sequence into text content.The more features learned by these two models,the more accurate the recognition result.Based on this,this paper builds a baseline model with DFCNN-CTC as the acoustic model and N-gram as the language model.However,the N-gram does not have the required semantic connection and the training parameters are too large,so the language model needs to be improved.This paper also conducts research on the characteristics of work report-oriented.The first is that Internet workers often have mixed sentences in Chinese and English when they report on work.And the second is that the input speech may be a long speech or a continuous input of multiple sentences,and the recognition result is a series of Chinese characters without spaces or punctuation marks,which is not conducive to the user’s understanding.In response to the above problems,this paper introduces the Transformer language model to improve the baseline system.Its core component,the self-attention mechanism,can consider the impact of all words in the entire sentence on the appearance of the current word,so that there is a semantic correlation between words,which makes up for the insufficiency of the N-gram model.At the same time,for the research of Chinese and English speech recognition,this paper constructs a ChineseEnglish mixed speech data set Chi-Eng,and compares the two models on three data sets.The results show that the model proposed in this paper is superior to the baseline model in terms of recognition accuracy and performance,and can successfully complete Chinese and English speech recognition.Secondly,this paper introduces the voice activity detection algorithm into the model to solve the long speech as input and the breakpoint problem of a series of Chinese characters.The activity detection can interrupt the continuous speech signal.Using this feature,the long speech signal can be cut into short speech segments,and then speech recognition is performed.At the same time,according to the breakpoint information,determine the stop position and add punctuation marks.The experimental results show that the speech recognition ensures the accuracy and make the recognition results more readable.Finally,based on the research content of this paper,a work report-oriented speech recognition system is designed and implemented based on an enterprise’s work report-oriented system. |