Font Size: a A A

Improved Research On Speech Emotion Recognition Based On Phonological Representation

Posted on:2019-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:L J ShenFull Text:PDF
GTID:2437330548481248Subject:Education Technology
Abstract/Summary:PDF Full Text Request
Background:Speech is the most natural and efficient way to human machine interaction.Speech includes human's emotional state.Speech emotion recognition is the key technique to realize more natural and more intelligent human machine interaction.The improvement of speech emotion recognition(SER)relies on the classifiers and features.In terms of feature selection,so far,most of the research only uses a large set of acoustic features which can't shed light on the relationship between emotion and prosody.Goal:(1)We improve SER by combining phonological representations and acoustic features with deep learning method;(2)We explore the relationship between prosody and specific emotions in order to get the specific pattern of prosody in different emotions.Method:Our experiments include two parts.(1)We improve the SER on the public interactive emotional dyadic motion capture(IEMOCAP)database by combing acoustic and phonological features together under a leave-one-speaker-out cross validation framework.The experiments are based on utterance-level and clustered acoustic words-level.A support vector machine,logistic regression and a convolutional neural network(CNN)are used in our experiment.(2)We analyze the discriminative power of phonological and acoustic features for emotion recognition with logistic regression models.Results:(1)With phonological representations,CNN provides 60.02%of the unweighted average recall(UAR)on categorical emotion recognition which becomes the state-of-the-art.When compared to the baseline system that is based on acoustic features only,the proposed system with combined features gets 3.1%improvement of UAR in category emotion classification,4.08%,3.51%and 3.9%improvement of UAR in activation,dominance and valence dimensional emotion classification,respectively.(2)In the experiment based on clustered acoustic words,combing acoustic features and phonological representations,long short term memory recurrent neural network performs the best among almost all the tasks,achieving 44.98%,52.94%,44.79%,38.05%of UAR,respectively,whereas the recurrent neural network performs the worst.(3)pitch accent and break indices are discriminative in distinguishing dominance and activation.Emotion with high activation has more pitch accent and speaking rate is very fluent.Emotion with high dominance has more long pauses.(4)Logmel frequency band and loudness are the top two most discriminative acoustic features.Loudness is very predictive in distinguishing activation.Conclusions:The outcome is objective on the public database.We found some salient phonological features to help distinguish specific emotions.This research provides us with the relationship between phonology and emotion.Combining phonological representations and acoustic features in SER will become a good baseline system in the future.Novelty:(1)explore the relationship between emotions and prosody and unveil the specific prosodic patterns of speech in different emotions;(2)getting the improvement of speech emotion recognition by combing acoustic features and phonological representations;propose the new ideas of emotion recognition based on acoustic words which is inspired by the application of deep learning in natural language processing.
Keywords/Search Tags:speech emotion recognition, acoustic features, phonology, feature analysis, deep learning
PDF Full Text Request
Related items