Font Size: a A A

Research On Speech Emotion Recognition Based On The Hierarchical Fusion Of Long And Short Memory Networks

Posted on:2022-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:2480306761464404Subject:Telecom Technology
Abstract/Summary:PDF Full Text Request
Speech Emotion Recognition(SER)is an emotion recognition method based on human natural language,which is a key way to identify individual emotions in daily speech.SER uses the acoustic features of a speech fragment rather than the lexical features that have the semantic information of the fragment.So it identifies subjects' emotions by the "way" they speak,rather than the content of what they say.In intelligent human-computer interaction and human-computer interaction service,predicting the target speaker's emotional context can be an important factor in decision making,which is the key for computers to understand human emotions and the premise for human-computer interaction.In the field of speech emotion recognition,many features can express speech emotion.If the unique advantages of different speech emotion models are combined and the features are fused together,the recognition performance can be improved effectively.But in practice,traditional SER uses a simple series and parallel connection of two speech emotional features.This processing method directly leads to the increase of dimension of speech emotion feature after fusion,which increases the computational burden of the whole recognition process excessively.Thus,the space and time complexity of recognition system is invisibly increased.Later deep learning methods can learn nonlinear representations of valid speech signals at different input levels.It has been widely used in voice print recognition,speech recognition,emotion recognition and other fields.Deep neural network,convolutional neural network and cyclic neural network are commonly used in speech emotion recognition.However,the method based on deep learning can not fully mine local feature information and ignores the context coherence of global feature.Therefore,in order to highlight the signal characteristics of different tasks and solve the above problems in SER,this paper proposes a speech emotion recognition method based on hierarchical fusion of long and short memory network.The specific innovations are as follows:1)Four Conv LSTM blocks with dual channels are designed to extract local emotion features with hierarchical correlation.The Conv LSTM layer is used for input-state and state-state transitions,and the convolution operation is used to extract spatial cues.Conv LSTM focuses on key elements in speech fragments that can easily identify sequential sequences of speech signals,ensuring the predictive performance of speech emotion recognition frameworks.Residual learning strategies are used to extract temporal and spatial cues from hierarchical speech signals.2)In addition,a novel sequence learning strategy is used to extract global information,and the gated loop unit(GRU)is improved to adaptively adjust the relevant global feature weights according to the correlation of input features.After running the three-layer bidirectional GRU model,the attention mechanism is fed in to obtain significant features.Then operate the whole connection layer to get the judgment result of each emotion.3)Finally,the center loss function is used with softmax loss to generate probabilistic classification.The improved center loss increases the final classification result,ensures the accuracy of prediction,and plays a significant role in the whole speech emotion recognition scheme.The proposed method is tested on two standard interactive emotional speech,song audiovisual databases and Common Voice Chinese Voice data set,and the results show that the proposed method is effective.
Keywords/Search Tags:Speech emotion recognition, Convolutional long and short term memory network, Hierarchical correlation, Gated recurrent unit
PDF Full Text Request
Related items