Font Size: a A A

Research On Speech Emotion Recognition Methods Based On Deep Learning

Posted on:2024-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:R WangFull Text:PDF
GTID:2545306944969869Subject:Communication engineering
Abstract/Summary:PDF Full Text Request
Speech emotion recognition(SER)refers to the process of analyzing and extracting emotional characteristics from speech data collected from a speaker.The applications of SER are extensive,and there is an increasing demand for server-based deployment.Emotion-related businesses have vast potential and commercial value in practical settings.However,the requirements for multilingualism and recognition accuracy are relatively high in practical application scenarios,which presents challenges for the deployment of SER in real-world situations.Currently,SER technology is still faced with many difficulties,such as small emotional speech corpus and the difficulty of annotation,as well as interference from semantic language information on deep learning models,which results in low recognition accuracy and poor generalization ability of the models.Therefore,this thesis aims to enhance the accuracy of emotion recognition and the ability to recognize emotions across languages,and the following tasks are carried out:(1)The diversity in the distribution of speech signals across languages and cultures may cause a decline in recognition accuracy on different datasets.This thesis proposes an emotion recognition network architecture based on attention mechanisms and bidirectional long short-term memory network.The proposed method is tested on multiple datasets and significantly improves the accuracy of emotion recognition.Additionally,a method based on local feature alignment is proposed,which can train models on small corpora without emotion labels.This method avoids problems of negative transfer caused by cross-lingual differences through using a more effective feature alignment compared to traditional feature alignment algorithms.The proposed method achieves an average improvement of 6.18%(2)Using a single acoustic feature for modeling emotions is not effective.Therefore,this thesis proposes a method that combines both acoustic and semantic features to enhance emotion recognition accuracy.Firstly,a multimodal baseline based on the BERT and AlexNet is built to process lingual and contextual features.Secondly,this thesis uses late fusion to process the concatenated emotion features.The effectiveness of the proposed algorithm is validated on the IEMOCAP dataset.The proposed method achieves an improvement of 4.31%compared to baseline.
Keywords/Search Tags:Speech Emotion Recognition, Domain Adaptation, Cross-language, Multimodality Integration
PDF Full Text Request
Related items