Font Size: a A A

Research On Unspecified Person Speech Emotion Recognition Based On Neural Network

Posted on:2021-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:R MaoFull Text:PDF
GTID:2428330614458518Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The voice signal contains a variety of voice information and various rich emotional states.In order to carry out human-computer interaction more efficiently and quickly,studying the emotional features included in the voice signal and comprehending its emotional element are of great application value significance.In order to improve the recognition rate of unspecified person speech emotion recognition,this thesis start from the extraction of feature parameters and the selection of models.One is to select the time and frequency domain features that can effectively express the speech emotion,the second is to improve the recognition model,and apply them to the unspecified person speech emotion recognition.The main research contents and innovations of this thesis are as follows:1.A summary of research on speech emotion recognition theory.The basic principles of speech emotion recognition and related theoretical knowledge are analyzed,which provides a theoretical basis for the study of speech emotion recognition in this thesis.2.Fusion of speech emotional features.By extracting MFCC features and using DCNN to extract Mel spectrum features,using the multiple kernel learning to fuse the two into new features,and the generated kernel function is used for SVM classification.The experimental results in the EMO-DB corpus and the CASIA corpus prove that the average recognition rates of corresponding speech emotions are 90.14% and 91.5%,respectively.The fusion of multiple features has higher classification accuracy than the classifier using a single feature.Compared with other speech emotion algorithms,the improvement is 4.85% and 3.14%,so the proposed method can effectively improve the recognition rate.3.Propose an improved model of DCNN Bi GRU self-attention.Using DCNN to extract the Mel spectrum can better capture the features of speech emotion representation space.GRU is a variant of LSTM.The Bi GRU network combines the advantages of the two-way recurrent neural network and the Long Short-Term Memory network.It can learn the context information of the speech emotion sequence data about time.In order to deal with the situation where the output error of RNN slowly dissipates and the memory is reduced,GRU neurons can be used instead of RNN.In addition,the calculation of the GRU tensor is not much,so compared to LSTM,the training speed of GRU is more efficient.At the same time,using the self-attention mechanism enables the network model to focus on judging the effect of each frame of speech on emotions,and selecting the proportion of emotional information in each frame of speech according to its effect on emotions.By conducting experiments on the EMO-DB and the CASIA Corpus,the corresponding speech emotion average recognition rates can reach 89.53% and 91.74%,respectively.Compared with other RNNs-based models,the improvements are 9.49%,4.09%,and 0.87%,which proves the feasibility of the model.
Keywords/Search Tags:speech emotion recognition, feature fusion, BiGRU, self-attention mechanism
PDF Full Text Request
Related items