Font Size: a A A

Research On Depression Recognition And Depression Severity Estimation From Audio,Video And Text Information

Posted on:2021-01-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:L YangFull Text:PDF
GTID:1524307316495894Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Depression has become a high-incidence mental disorder,causing a heavy burden on people’s work and life.At present,there is a contradiction between the proliferation of depression patients and the serious shortage of medical resources.It is urgent to develop an automatic evaluation and diagnosis system for depression,so as to provide early warning and relieve the pressure of doctors in clinical diagnosis.In this dissertation,we focus on the audio,video,text features and multi-modal models to study the depression recognition and depression severity estimation methods.Experiments have been verified on the Audio Visual Emotion Challenge(AVEC)depression databases.The main contributions of this dissertation are as follows:1.An audio,video and text hybrid multi-modal depression severity estimation and depression recognition model based on Support Vector Regression(SVR)and Decision Tree is proposed.First,we extract the audio and video features,a Local Linear Regression(LLR)and SVR is utilized for estimating the depression severity PHQ-8 score.Then,according to the transcript text,the information of sleep status,recent feelings,personality and other life status of male and female patients,as well as the distribution of PHQ-8 score,are analysed.On this basis,a depression recognition model based on decision tree is designed for male and female,respectively.Experiments are carried out on the Audio-Visual Emotion Recognition Challenge 2016(AVEC2016)depression database,the F1 score reaches 0.724 on the test set,which obtains the best performance of AVEC2016 depression challenge.2.In terms of multi-modal depression analysis,for the visual feature,a Histogram of Displacement Range(HDR)based on the facial landmarks is proposed to describe the facial motion retardation of depression.For the text feature,we propose to use Paragraph Vector(PV)and Support Vector Machine(SVM)to automatically classify the sleep status,feelings and other life status.For the multi-modal fusion strategy,we propose three multi-modal fusion frameworks based on audio,video and text:1)Triplet Deep Models for Depression Estimation(TriDep-E):Audio,video and text features are input into Deep Convolutional Neural Network(DCNN)and Deep Neural Network(DNN)models to obtain the single modality depression severity estimation,respectively,and the depression severity level is estimated by decision fusion.2)Integrated Deep and Shallow Models for Depression Recognition(InDepS-R):First,the audio-video features are used to estimate the depression severity level through DCNN and DNN,and the life status is classified by PV and SVM from text.Then,depression severity level and the classification results of life status are concatenated and input into a Random Forest(RF)for final depression recognition.3)Integrated Deep and Shallow Models for Depression Estimation(InDepS-E):First,the audio-video fusion depression severity estimation models are trained for depression and non-depression samples,respectively.Then,we obtain the depression/non-depression recognition result from text modality.The depression severity level estimated by audiovideo and the depression recognition result obtained by text are fused by a multivariate regression to obtain the final depression severity estimation.InDepS-R achieves the best depression recognition result on the test set of AVEC2016 depression dataset with average F1 score as 0.746.InDepS-E achieves a Root Mean Square Error(RMSE)of 5.400 and a Mean Absolute Error(MAE)of 4.359 on the test set of AVEC2017 depression database,which are lower than those of most of the state of the art depression estimation methods.3.Aiming at release the problem of the limited amount of annotated depression data,we propose a two-level hierarchical Deep Convolutional Generative Adversarial Network(twolevel hierarchical DCGAN)for feature augmentation from speech,so as to expand the depression training set.Besides,to measure the quality of the augmented speech features,we propose three different measurement criteria from characterizing the image entropy,frequency and deep learning aspect.The augmented features are used to train depression estimation DCNN models.The proposed DCGAN based data augmentation approach effectively improves the performance of depression severity estimation,with the RMSE reduced to 5.520 and MAE reduced to 4.634,which are lower than those of most of the state of the art audio-based depression severity estimation methods on AVEC2017 depression database.4.In terms of depression severity estimation based on facial information,a model combining face patch spatial and temporal attention is proposed to describe the emotional expression of different regions of the face at different frames:1)Inspired by the experience of human FACS coders,we propose FACS3D-Net that integrates 3D and 2D CNN to simultaneously encode the facial spatial and temporal representation for Action Unit(AU)detection.FACS3D-Net achieves an average F1 score of 59.71%on EB+(Expanded BP4D+)AU database,which outperforms the 2D CNN and 2D CNN-LSTM models.2)Considering the region of AU,we propose a dynamic patch-attentive deep network for action unit detection(D-PAttNet).A spatial sigmoidal attention mechanism is proposed and allows multiple static and dynamic patch encodings to contribute to the prediction of specific AUs.D-PAttNet obtains the state-of-the-art performance on Binghamton-Pittsburgh 3D Dynamic Spontaneous Facial Expression Database(BP4D)with 64.7%F1 score.3)Based on D-PAttNet,we further add a temporal attention to the D-PAttNet architecture and propose a Multi-Attentive Dynamic Patch Network(MultiAtt-DPNet)for depression severity estimation,MultiAtt-DPNet not only makes the model focus on emotionally salient facial areas,but also learns the more salient frames and automatically assigns higher weights to these frames.MultiAtt-DPNet achieves a RMSE of 9.190 and a MAE of 7.408 on the AVEC2014 depression database,which are lower than those of most of the state of the art depression estimation methods.5.Aiming at distinguishing the three episodes of bipolar disorder(mania,hypomania and remission),we propose the arousal histogram feature based on audio,and a multi-modal assemble learning framework based on DNN and RF is proposed to fuse audiovisual modalities.Experimental result of Unweighted Average Recall(UAR)on the AVEC2018 bipolar depression challenge reaches 0.574,which achieves the state of the art result.
Keywords/Search Tags:Multi-modal Depression Recognition, Multi-modal Depression Severity Estimation, Deep Convolutional Generative Adversarial Network, End-to-End Multi-attentive Dynamic Patch Network, Bipolar Disorder Detection
PDF Full Text Request
Related items