| Language is an important way for people to communicate.It not only has the function of transmitting information,but also has the function of expressing emotions.The same utterances often express different meanings in different emotional environments.This is the case in languages all over the world.Therefore,speech emotion recognition,as an important branch of speech recognition,has important research significance.Voice signal is one of the most basic and important modules in the field of emotional computing of artificial intelligence.For the task of speech emotion recognition,important researchers at home and abroad usually process the speech features converted from speech signals directly or convert them into speech spectrograms for recognition.Moreover,speech emotion recognition is of great significance in production applications,and a variety of applications have been derived,such as: machines for assisting the disabled with speech impairment,service machines for feeling customer emotions,etc.This paper has carried out in-depth research on bilingual speech emotion recognition,and proposed a dual-speech speech emotion recognition model based on auto Encoder +LSTM and a dual-speech emotion model of bilinear capsule network,and carried out corresponding experiments through these two models.,Compared with the current mainstream speech emotion recognition methods,and draw conclusions and prospects.The specific content of this paper includes:(1)A bilingual speech emotion recognition model based on autoencoder + LSTM is proposed.This model passes the original speech data into the autoencoder to extract deeply abstract speech emotion features,and further processes the deep features through the LSTM network,and achieves better recognition results in the mixed corpus of German EMO-DB and Chinese CASIA.(2)Propose a bilingual speech emotion recognition model based on fusion feature bilinear capsule network.In this paper,the capsule network,which is popular in the field of image recognition in recent years,is applied to the exploration of SER.Aiming at the differences between bilinguals,this paper proposes a method of fusion of mel spectrogram and frame statistics feature map.At the same time,the capsule network is improved.The bilinear convolution kernel is used to further extract the texture features of the fusion image.The model has further improved the recognition rate of German EMO-DB and Chinese CASIA corpus,and provides a new idea of speech and image recognition.For the above two models,a large number of comparative experiments have been carried out.The experimental results show that the depth features extracted by the autoencoder and the features of the Mel spectrogram and frame statistics map are better than the original speech feature.At the same time,the two models have achieved excellent recognition effects on bilingual speech emotion recognition.,The recognition rate is greatly improved compared with the traditional model. |