| Speech Emotion Recognition(SER)is one of the key technologies of human-computer communication,and the extraction of speech emotion features is an important basis for emotion discrimination.Among them,spectrogram can show the harmonic,formant,energy and other information contained in speech through two-dimensional images,which is one of the most effective ways to extract emotional features.However,many effective feature extraction models have high complexity,which is not conducive to model training and later improvement.Therefore,reducing the complexity of the model and effectively extracting the emotional features of the spectrogram are of great significance to the research of speech emotion recognition.Focusing on the problem of reducing the complexity of speech emotion classification model and effectively extracting emotion features,this paper first proposes a low-complexity Dual Nested Residual Network9(DN-Res Net9)to extract fused emotion features.Secondly,in order to further improve the performance of emotion feature extraction,a weighted fusion model based on dual nested residual and channel residual network(WFDN_CRNet)is proposed,which is based on DN-Res Net9 and fine-tuned channel attention residual network(CRNET).Finally,in order to improve the emotion recognition rate in natural speech data sets,this paper combines the spectrogram with the text,and discusses the effective methods of bimodal feature fusion and decision-making level fusion.The specific research contents are as follows:(1)Based on the improvement of residual structure,this paper proposes a low complexity DN-Res Net9 model.Because the residual block of the residual network can fuse the extracted feature map with the original feature map and extract more different emotional features,this paper proposes a double nested residual structure,which takes the residual structure in Res Net as a small-scale residual module and embeds a layer of large-scale residual connections on the basis of this structure,so as to obtain the fused emotional features with more emotional differences.Tests on Chinese Academy of Sciences(CASIA)speech emotion data set and Berlin Emotional Speech Database(EMO-DB)show that the average recognition rate of this model is 89.58% and 81.98% respectively,which is 2.08% and 2.78% higher than that of Res Net18.Compared with Res Ne Xt,it is improved by 3.75% and 2.70%respectively,but the recognition performance of emotion with high similarity of texture in spectrogram is poor,and the recognition rate needs to be improved.(2)In order to further reduce the emotional confusion with high texture similarity of spectrogram and improve the recognition rate of speech emotion,this paper proposes a fusion emotion feature extraction and enhancement algorithm based on WFDN_CRNet.Firstly,based on the preprocessing method of the spectrogram,a double-branch feature extraction method is proposed,and the emotional feature enhancement algorithm based on Guided Filter(GF)in the upper branch enhances the edge emotional information of the spectrogram,increasing the extraction range of features,and training with DN-Res Net9 to obtain the emotional enhancement features of the spectrogram;Local Binary Pattern(LBP)is introduced into the lower branch to process the spectrogram,which can further improve the texture information of emotion enhancement features.Based on Effcient Channel Attention mechanism(ECA),a channel attention residual network is proposed to extract its texture emotion features.Then,the features extracted by two branches are weighted and fused to improve the emotional representation.Finally,the whole connection layer is used for emotion classification.Under CASIA data set and EMO-DB,the emotion recognition rate of WFDN_CRNet is 94.58% and 85.59%respectively,which is 5.00% and 3.61% higher than that of the emotion recognition algorithm only through DN-Res Net9.Compared with the DN-Res Net9 algorithm under the weighted spectrogram processed by the feature enhancement algorithm based on multi-scale guided filtering,its recognition rate is increased by 1.66% and0.91% respectively;Compared with the CRNet model based on LBP texture feature extraction,its recognition rate is increased by 18.33% and 24.33% respectively,which proves that the weighted fusion feature extracted by WFDN_CRNet enhances the emotional representation ability of the model.(3)In order to improve the emotion recognition rate in natural language,this paper proposes a bimodal emotion recognition algorithm combining speech and text.Firstly,the pre-trained Glove model and bi-directional long-short term memory(Bi-LSTM)are used to extract the text emotion features,and then DN-Res Net9 is used to extract the speech emotion features.After the text emotion features and speech emotion features are weighted and fused by the decision-making level,the emotion is classified.Experiments verify that bimodal emotion recognition is superior to monomodal emotion recognition,and discuss the performance of bimodal emotion recognition under four fusion feature modules based on feature splicing,decision splicing,feature-level weighted fusion and decision-level weighted fusion.It is verified that the weighted fusion algorithm of decision-level of this model achieves the best emotion recognition performance,and the unweighted recognition rate is 75.49% in the interactive emotion binary motion capture(IEMOCAP)data set.Compared with monomodal text recognition model and speech emotion recognition model,unweighted recognition is more effective.Compared with the bimodal emotion recognition rate models under feature stitching,decision stitching and feature layer weighted fusion,the unweighted recognition rates are increased by 2.23%,2.03% and 0.27% respectively. |