| Speech emotion recognition(SER)is an important research direction in the field of speech and emotion computing.The purpose of the SER is to detect the emotional state of speakers,so that machines can automatically recognize people’s emotional state from speech signals.Speech emotion recognition has a wide range of applications,such as in human-computer interaction,intelligent customer service,automatic call center,and as a diagnostic tool to diagnose depression patients.Some scholars have found that speech emotion is a kind of ambiguous data due to the subjective emotion.However,most of the previous speech emo-tion recognition methods ignore the ambiguous of emotion and deal speech emotion recogni-tion with a simple classification task,resulting in poor recognition accuracy.This paper will give full consideration to the ambiguous of speech emotion data,combined with the char-acteristics of speech emotion data,from the emotion data label,and use a better algorithm to re-model the speech emotion feature information,and finally classify it.In this paper,three methods are proposed to solve the problem of ambiguous speech emotion recognition.Respectively are,considering the simple data and the difficult data in the emotion data,the Curriculum learning is based on the progressive Co-teaching method for solving this prob-lem;Due to a single speech containing silent frame in emotion,an ambiguous speech emotion recognition method based on frame-level temporal convolutional network is proposed,which can ignore the silent frame and active frame in emotion and enhance the useful frame infor-mation.Finally,combined with meta-learning idea,attention mechanism method is used to enhance the effect of clear data on network training and weaken the effect of ambiguous data,so an ambiguous speech emotion recognition algorithm based on meta-learning progressive Co-teaching is proposed.The details are as follows:(1)In chapter 3,an ambiguous speech emotion recognition method based on progressive Co-teaching is proposed.The ambiguity of emotion is represented by a small number of simple samples and a large number of difficult samples in the emotion database.Previously,a scholar proposed that curriculum learning could be applied to ambiguous speech emotion recognition,that is,manual partitioning of simple samples or difficult sample sets in the database,and then learning simple data sets first and then difficult data sets through the network,resulting in time consuming.My method can automatically judge the data difficulty degree according to the data loss value,so that simple data features in the database can be learned first,and then difficult data features in the data can be learned.Therefore,this kind of curriculum learning and training mode can be established automatically,and the PCT method is proposed to improve the effect more significantly.so as to improve the accuracy of ambiguous speech emotion recognition.(2)In chapter 4,a frame-level temporal convolutional network based ambiguous speech emo-tion recognition method is proposed.Due to the time domain characteristics of a speech,there may be emotional silence period and emotional activation period.In addition,a com-plete speech may contain emotional information with different probabilities.Therefore,the traditional single hard label cannot properly represent ambiguous emotion data.Therefore,the method proposed in Chapter 4 mainly consists of two parts.The first part proposes that advanced soft-label can better redefine ambiguous speech data label information and solve the problem that speech contains emotional information with different probabilities.At the same time,the second part about network is proposed for the emotional silence frame and the activation frame in speech.The temporal convolutional network is used to better model the data of speech emotion information,and the attention mechanism is used to ignore the emotional silence frame in speech and enhance the network effect of the emotional activa-tion frame in speech.Thus,the accuracy of ambiguous speech emotion recognition can be improved.(3)In chapter 5,an ambiguous speech emotion recognition algorithm based on meta-learning progressive Co-teaching is proposed.Due to the subjective characteristics of emotion,there may be ambiguous data due to the emotional labels annotated by the subjective experts.This paper will be to give a weight to each data based on the attention mechanism in the training process,so as to expand the effect of clear data and reduce the influence of ambiguous data on network training.In addition,the method in this chapter combines the advantages of the progressive Co-teaching method in Chapter 3 and the frame-level temporal convolutional network algorithm in Chapter 4 to improve the ambiguous speech emotion recognition al-gorithm from data and network aspects,so as to enhance the accuracy of the algorithm for ambiguous speech emotion recognition. |