Font Size: a A A

Speech Emotion Recognition Based On Graph Convolution Neural Network

Posted on:2024-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:F F XuFull Text:PDF
GTID:2558307136992949Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Speech Emotion Recognition(SER)is a technology that studies the emotional information in human speech.It identifies and understands the emotional state of the speaker by analyzing acoustic features and emotional information in the speech signal.Previous research has mainly focused on traditional machine learning methods,convolutional neural networks(CNNs),and their variants.However,issues such as diverse expression capabilities,variable emotion categories,and differences in the length of speech samples have made it challenging for manually designed features to cover most of the emotional information in the samples.In recent years,with the rise of graph model networks,Graph Convolutional Neural Networks(GCN)have shown excellent performance in the field of deep learning.Therefore,this paper first studies the deep feature extraction of speech samples,then introduces GCN to handle speech emotion recognition tasks,and finally improves the adjacency matrix.The main research work is as follows:(1)To address the problem that the extracted emotional features from speech samples cannot cover most of the emotional information and to capture the temporal information of speech samples,this paper proposes a speech emotion recognition method based on Bidirectional Long Short-Term Memory and Graph Convolutional Neural Networks(BLSTM-GCN).The method first uses the Open SMILE toolkit to extract frame-level speech emotion features.Then,Bidirectional Long ShortTerm Memory(Bi-LSTM)is used to further extract deep frame-level emotion features.The extracted frame-level emotion features are divided into two paths for subsequent networks.Next,the framelevel deep emotion feature vectors are constructed into a graph structure,and GCN is used to train the features.The speech samples are globally modeled using sumpooling.Finally,softmax function is used for prediction and classification.This method first extracts emotional features from speech samples at different levels,enhancing feature representation.Then,GCN is introduced as a baseline network for speech emotion recognition tasks,replacing common CNNs and their variants.GCN can effectively optimize node features by utilizing the topological structure between node features.Experimental results demonstrate that this method achieves weighted accuracies of 66.04% and 57.5%on the IEMOCAP and MSP-IMPROV databases,respectively.(2)To address the limitation of predefined adjacency matrices in GCN for node information interaction,this paper proposes the use of decayed connection adjacency matrices and adaptive adjacency matrices in GCN.The decayed connection adjacency matrix method adds the predefined adjacency matrix element-wise with a relation matrix,and then applies a hyperparameter to the resulting matrix.The adaptive adjacency matrix first initializes a learnable node embedding matrix randomly.Then,it infers spatial dependency relationships between node pairs through the calculation of node similarity.Finally,during the model training process,it calculates the loss based on the objective function,updates the node embedding matrix through an optimizer,and gradually approaches the optimal value from the random initial value.Compared to predefined adjacency matrices,the two proposed adjacency matrices enhance information interaction between nodes.Experimental results show that the proposed methods achieve weighted accuracies of 66.82% and58.35% on the IEMOCAP and MSP-IMPROV databases,respectively.
Keywords/Search Tags:Speech Emotion Recognition, Deep Learning, Feature Extraction, Graph Convolutional Neural Networks, Adjacency Matrix
PDF Full Text Request
Related items