| With the rapid development of artificial intelligence and the wide application of deep learning technology,more and more smart devices and applications are developed and used,humanized and intelligent human-computer interaction has become an urgent need,and people begin to aspire to the future human-computer interaction.Emotion recognition is one of the foundations of intelligent human-computer interaction,which can help computers to recognize human emotions and feelings,so as to better understand human needs and intentions,and realizing emotion recognition through speech is an important prerequisite for natural human-computer interaction.In this thesis,deep learning methods are used to conduct in-depth research on recognition models,multimodal emotions,and interaction applications in speech emotion recognition,which effectively improves the recognition rate of speech emotion.The main research contents are as follows:1.A CNN speech emotion recognition method with anterior-posterior time perception is proposed to address the problem of poor processing of time series by convolutional neural network(CNN).The frequency domain feature Mel frequency cepstral coefficients of speech are extracted as feature inputs,and the time-aware module(38)is designed using the dilated causal convolution in 1D-CNN,and the bidirectional time-aware network is designed based on this module and 1D&2D-CNN.The left and middle channels use n temporal perception modules to capture the emotional information hidden in the frames from both positive and negative temporal directions,where different values of n affect the accuracy of the model;the right channel is a multi-scale 2D-CNN model for identifying global features.Comparative experiments are conducted on EMODB and RAVDESS datasets in terms of accuracy,precision,recall,and F1 values,and compared with other research results,the proposed model is more effective in recognition.2.To address the problem of low recognition rate of unimodal features,a bimodal speech emotion recognition method based on attention mechanism,both MCNN-Bi LSTMATTENTION for speech text model,is proposed.Firstly,the word embedding vector of the Merle spectrogram and text information is used as the feature input,and the data augmentation method is employed to reduce the impact of data set imbalance.CNN is used as the base framework for audio,and Multiscale convolution(MCNN)with multiple dimensions is added on top of it,and Efficient channel attention(ECA)method is used to increase the accuracy rate.For text,Bidirectional LSTM(Bi LSTM)model is used,and SelfAttention mechanism is added after its output sequence to increase the weight of important emotional words to improve the recognition accuracy of text.Comparative experiments are conducted on the IEMOCAP and MELD datasets in terms of accuracy,precision,recall,and F1 values to test unimodal recognition,as well as ablation experiments on each part of the module,respectively.The results show that MCNN-Bi LSTM-ATTENTION has a higher recognition rate compared with other methods.3.In order to verify the practicality of the proposed method,a speech emotion recognition terminal is designed and developed with a high-performance embedded device jetson Xavier NX as the main control,and a good human-machine interface is provided.Py Qt5 in python is used as the GUI framework under Linux environment,and the corresponding interface software is designed and developed in detail based on the proposed algorithm,and the corresponding flowchart is given,so that the designed and developed terminal has the functions of recording,playing,emotion recognition,and speech feature display.According to the terminal test results,it can achieve the desired effect.In this thesis,two novel speech emotion recognition methods are proposed to solve the problems of poor CNN processing time series and low recognition rate of unimodal features of existing speech emotion methods;and the corresponding speech emotion recognition terminal is designed and developed to verify the effectiveness of the proposed methods,which can be widely used in speech emotion detection,mental health treatment,intelligent customer service,etc.,with good theoretical and practical application value. |