| Depression has been defined as a serious disease by the World Health Organization,and the number of diagnosed people is showing a large increase and a trend of gradually decreasing age,which has caused serious impacts worldwide and increased the social and medical burden.Recently,based on machine learning and deep learning methods,a lot of research has been done on multimodal depression recognition technology using feature types such as video and audio.Multimodal depression recognition is a multidisciplinary research that integrates psychology,medicine,and computer science cross-cutting research topics.In recent years,many advanced deep learning models have been widely used in the field of automatic depression recognition.Long short-term memory neural networks represented by LSTM(Long short-term memory)are used to analyze sequence data,and CNN(Convolutional Neural Network)Represented by convolutional neural networks for analyzing image and video data.Due to Transformer’s excellent contextual dynamic information modeling capabilities,the self-attention mechanism has also attracted extensive interest and applications.This paper conducts research on automatic depression identification technology,and the research contents are as follows.(1)For single-modal audio depression recognition,in order to obtain the local and global temporal context features of the audio,this paper proposes an audio depression recognition method based on Transformer and LSTM.First,low-level raw features of depression audio clips are extracted from the dataset videos.Then,the method uses Transformer to extract global high-level audio temporal features of audio,and LSTM to extract local high-level audio temporal features of audio features.The local and global features are fused using model layer fusion.Finally,depression assessment is performed using a fully connected layer.Experiments show that the method shows good performance on audio automatic depression assessment.(2)For single-modal video depression recognition,in order to obtain video global differential attention features,this paper proposes a video depression recognition model based on differential convolution and self-attention mechanism.First,16 frames are extracted from the dataset video as the frame-level raw input of the model.Then,use the differential convolutional layer to extract the deep differential spatio-temporal features of the video,and use the self-attention mechanism to give greater weight to the effective features and obtain the final global differential attention features.Finally,a fully connected layer is used for depression level estimation.Experiments show that the method shows good performance on automatic depression assessment from videos.(3)In order to combine audio and video to obtain audio global attention features and video multi-scale differential attention features,this paper proposes a multimodal depression recognition model using differential convolution and Transformer.The method first extracts the raw data of audio and video from the dataset,using an end-to-end training method.Using differential convolution to extract video features,the differential method can reduce the uncertainty caused by inconsistent frame steps,and3 D convolution can extract deep spatiotemporal features of video data.Use Transformer to extract audio features,and use Transformer’s global context modeling capability to model audio data.Finally,the attention mechanism is used to fuse the audio features and video features.Experiments show that the method shows good performance on audio-video multimodal automatic depression assessment. |