| In recent years,with the in-depth development of artificial intelligence,the big language model technology represented by Chat GPT has opened the door to artificial general intelligence.Among them,GPT-4’s breakthrough in understanding of instructions has made artificial intelligence become a hotspot in the recognition and expression of human emotions.In particular,voice emotion recognition in human-computer interaction has become increasingly important and has wide commercial application prospects.This article is based on research status at home and abroad of speech emotion recognition,starting from feature extraction,neural networks,and attention mechanisms to improve recognition performance.The main research content includes:(1)Traditional manual features have high dimensionality and require extensive experiments to select feature sets,which can only reflect time-domain or frequencydomain characteristics.They cannot associate time-frequency information and lack sufficient discernment to describe subjective emotions.This article extracts the Log Mel spectrogram,Delta,and Delta-Delta from speech signals to form 3D Log Mel feature.This feature not only accurately characterizes the time-frequency characteristics of speech signals,but also shows how the energy of speech signals changes with frequency.Moreover,this feature can significantly reduce the interference of emotional irrelevant factors such as language expression style,environment,and culture,thereby reducing the misclassification rate.(2)In response to the high complexity and long training time of speech emotion recognition models,this paper designs a lightweight network model based on deep separable convolution skip bi-directional gated recurrent unit,which utilizes deep separable convolution to reduce the number of parameters and improve training speed,and uses it to extract spatial features of 3D Log Mel;Uses skip Bi GRU with skip connections for temporal modeling of advanced features in the spatial domain,we focus more on contextual information in emotions.Compared with the baseline model,this network model reduces the number of parameters by nearly 50%,reduces the training time by nearly 30%,and has almost the same recognition rate as the baseline model.It effectively solves the problems of high model complexity and long training time.(3)In response to the problem of low model recognition rate caused by nonemotional feature information,this article introduces an attention mechanism that focuses on different time or frequency periods in speech signals,in order to better capture feature information related to emotional states.By changing the input method of the model,the model can learn emotional features in speech signals more comprehensively.We designed a DSC-Skip-Bi GRU network model based on attention mechanism,which applies self attention mechanism and multi head attention mechanism respectively.The experimental results show that after introducing multi head attention mechanism,the model achieves the highest performance at head=8,achieving 91.18% weighted average rate and 91.02%unweighted average rate in EMODB dataset,respectively;The weighted average rate and unweighted average rate in the IEMOCAP dataset achieved 79.81% and 79.73%,respectively;In response to the problem of model complexity caused by the excessive number of heads in the multi head attention mechanism,this paper adopts a locally sensitive hash dot product attention mechanism to replace the dot product attention mechanism,effectively improving model complexity.The model training speed remains relatively stable as the sequence length increases.(4)A speech emotion recognition system has been designed,which is developed based on QT as a whole.The system supports recording,audio loading,preprocessing,feature extraction,and emotion recognition functions.Apply the DSC-Skip-Bi GRU-LSH-Attention model with the best recognition performance to the system for experiments.The results indicate that the model has high accuracy from theoretical research to practice,and also verifies the effectiveness of the speech emotion recognition system developed in this paper. |