Font Size: a A A

Speech Emotion Recognition With Multitask Learning

Posted on:2024-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y F LiFull Text:PDF
GTID:2568307118479134Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Speech emotion recognition is an important part of speaker intent analysis.Therefore,accurate recognition of emotion in speech has important research value in the process of intelligent human-computer interaction.Traditional machine learning methods are limited by feature extraction and model learning capabilities,making it difficult to process complex speech signals.The development of deep learning has overcome these problems,and it can use multi-layer neural networks to learn and model speech signals end-to-end.This thesis mainly studies the speech emotion recognition based on attention frame pooling and multi-task learning by using the deep learning model.The specific content is as follows:Aiming at the difficulty in designing speech emotion features and the uneven distribution of emotion in speech time frames,a speech emotion recognition model based on attention frame pooling is constructed.First,WAV2VEC2.0 is used as the feature extractor to extract the frame-level representation of speech,and the speech representation is processed by introducing Multi-head Self-Attention(MHSA)and Structured Self-Attention Pooling(SSA).The pooling operation obtains the sentencelevel representation vector of the speech.Then,the sentence-level representation vector is input into the fully connected neural network for classification.The performance of the model is comprehensively analyzed by using two indicators,Weighted Accuracy(WA)and Unweighted Accuracy(UA).Experiments on the IEMOCAP data set show that the pooling strategy based on the SSA attention mechanism retains the most emotional information in the speech time frame,and its WA and UA reach 73.2% and74.5%,respectively,which is better than previous research methods.,which provides a new idea for the research of speech emotion recognition.In order to solve the problem that the single-task learning model does not pay enough attention to the acoustic emotion information in speech,a speech emotion recognition based on multi-task learning is proposed.The phoneme recognition related to emotion is used as the auxiliary task of speech emotion recognition,and the shared representation of two tasks is learned through multi-task learning to make it contain acoustic emotion information conducive to emotion recognition.Compared with the speech emotion recognition model based on single task learning,the WA and UA indexes of the speech emotion recognition model based on multi-task learning are increased by about 3.6% and 3% respectively.The structure of the multi-task learning model significantly improves the network’s extraction of emotional information in speech.Based on the speech emotion recognition method studied in this thesis,a speech emotion recognition software is designed.Use Py Qt as a software design tool.The visual interface of the software mainly includes four modules: user login,data preprocessing,online recording and emotion recognition.This software can realize the user’s recording,voice preprocessing and emotion recognition using the voice emotion recognition model based on multi-task learning.It can effectively help human-computer interaction researchers to carry out voice emotion recognition work,and has important engineering application value.The thesis has 31 figures,11 tables,and 92 references.
Keywords/Search Tags:Speech emotion recognition, Multi-task learning, Self-supervised model, Attention frame pooling
PDF Full Text Request
Related items