Research On Speech Emotion Recognition Methods Based On Deep Learning

Posted on:2024-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:R Wang

Full Text:PDF

GTID:2545306944969869

Subject:Communication engineering

Abstract/Summary:

PDF Full Text Request

Speech emotion recognition(SER)refers to the process of analyzing and extracting emotional characteristics from speech data collected from a speaker.The applications of SER are extensive,and there is an increasing demand for server-based deployment.Emotion-related businesses have vast potential and commercial value in practical settings.However,the requirements for multilingualism and recognition accuracy are relatively high in practical application scenarios,which presents challenges for the deployment of SER in real-world situations.Currently,SER technology is still faced with many difficulties,such as small emotional speech corpus and the difficulty of annotation,as well as interference from semantic language information on deep learning models,which results in low recognition accuracy and poor generalization ability of the models.Therefore,this thesis aims to enhance the accuracy of emotion recognition and the ability to recognize emotions across languages,and the following tasks are carried out:(1)The diversity in the distribution of speech signals across languages and cultures may cause a decline in recognition accuracy on different datasets.This thesis proposes an emotion recognition network architecture based on attention mechanisms and bidirectional long short-term memory network.The proposed method is tested on multiple datasets and significantly improves the accuracy of emotion recognition.Additionally,a method based on local feature alignment is proposed,which can train models on small corpora without emotion labels.This method avoids problems of negative transfer caused by cross-lingual differences through using a more effective feature alignment compared to traditional feature alignment algorithms.The proposed method achieves an average improvement of 6.18%(2)Using a single acoustic feature for modeling emotions is not effective.Therefore,this thesis proposes a method that combines both acoustic and semantic features to enhance emotion recognition accuracy.Firstly,a multimodal baseline based on the BERT and AlexNet is built to process lingual and contextual features.Secondly,this thesis uses late fusion to process the concatenated emotion features.The effectiveness of the proposed algorithm is validated on the IEMOCAP dataset.The proposed method achieves an improvement of 4.31%compared to baseline.

Keywords/Search Tags:

Speech Emotion Recognition, Domain Adaptation, Cross-language, Multimodality Integration

PDF Full Text Request

Related items

1	Cross-language Speech Synthesis Based On Deep Learning
2	Emotion Recognition Deficits In Heroin Abstainers
3	Research On Emotion Recognition Technology Of Tibetan Speech By Fusion Of Multiple Features
4	Research On The Differences Of Emotional Speech Characteristics Of Introverted And Introverted Individuals And Machine Recognition Of Emotional Speech
5	The Influence Of Gender Factors And Emotion Types On The Differences Of Emotion Recognition In Speech Dialogue
6	Studies On Physiological Signals Based Emotion Recognition
7	A Study On The Method Of Speech Recognition In Vietnamese Tourism
8	Research On Tibetan Speech Emotion Recognition Method Based On Multi-feature Fusio
9	An Experimental Study On The Cross-language Listening Recognition Of Chinese Attitude Phonetics
10	The Impact Of Within-domain Cue And Cross-domain Cue On False Memories