Font Size: a A A

Acoustic Model Training Based On Data Noise And Text-speech Alignment

Posted on:2020-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y P QinFull Text:PDF
GTID:2417330578973086Subject:Statistical machine learning
Abstract/Summary:PDF Full Text Request
In the AI era,speech is more popular in the field of human-computer interaction because it is more efficient than text.Speech recognition is a process which converts speech signal input into text output.Acoustic model is the core module of speech recognition system,its main training method is deep neural network hidden Markov model(DNN-HMM).Training acoustic model is a process of adjusting model parameters by inputting voice data and annotating text data.The more abundant voice data and the more accurate annotated text,the stronger generalization ability of model after training.However,at this stage,there are the following problems:(1)Training data is scarce and difficult to obtain.(2)Due to the errors of text input or the deviation of understanding,it is inevitable that there will be fewer words,more words and wrong characters when manually transcribing annotated text according to voice,which will lead to inaccurate annotated text data.The generalization ability of the model will be reduced if this kind of data is used in acoustic training.(3)In noisy background,the recognition rate of acoustic model is often very low because its voice data is usually pure audio recorded in quiet environments.In view of the above three shortcomings,the following work has been carried out in this paper:(1)Referring to the idea of data enhancement by adding noise in image recognition,this paper chooses four kinds of noise in airport,automobile,street and train,and uses Python program to process audio data.Through data adding noise,one is to solve the problem of acquiring training data,the other is to train the acoustic model of simulated noise background,which improves the robustness of the system in noise environment.(2)Aiming at the incorrect annotated text,this paper proposes a forward and backward algorithm based on traditional forced alignment(alignment of voice and annotated text,which belongs to speech-text alignment).First,the voice is recognized,and the results are stored by lattice.Then,the results of speech recognition and annotated text are validated by a posterior probability.In contrast,the annotated text data and corresponding audio data with higher error rate are eliminated,and the remaining data is used as model training.We call this speech-text alignment method lexicographic alignment.(3)According to whether the audio data is noised or the annotated text is aligned with the word graph,we carriu out the experiment of noising and alignment.The experimental results show that the recognition rate of the speech recognition system will be improved to some extent by using the two methods alone or in combination,which proves that the method in this paper is indeed feasible.
Keywords/Search Tags:Speech recognition, Hidden Markov Model, Acoustic model, Data noise, Viterbi forced alignment, Lattice, Deep Neural Network
PDF Full Text Request
Related items