As social pressure continues to increase,the number of people suffering from depression is also increasing.Severely depressed patients may develop suicidal tendencies.Currently,the scale-based doctor interview is the mainstream method for diagnosing depression,which is easily affected by subjective factors.In addition,there is a contradiction in the society that there is a shortage of professional medical staff and a high incidence of depression.Therefore,there is an urgent need to develop a solution to assist doctors in diagnosis.In clinical practice,text semantics,facial expressions,and motion information play a crucial role in the evaluation of clinical doctors.At present,some scholars rely on single modal information to establish models for diagnosing depression.However,single modal models have the disadvantages of limited information and low diagnostic accuracy,while multimodal models can improve diagnostic accuracy by utilizing the complementarity of different modal information.In order to use multimodal information to diagnose depression,in this paper,we collected and used a new dataset consisting of 196 subjects to establish a multimodal diagnostic depression model with high accuracy and good performance.The specific work and innovative points are as follows:For text semantic data,this paper constructs a Bi-directional Long Short-Term Memory to diagnose depression.This model can effectively use context information,and the diagnostic accuracy of depression is 80%,which is higher than that of support vector machine and naive Bayes model.For video information,this paper proposes the V-D-W-Transformer model,which improves the convergence and accuracy of the model by introducing window mechanism and diagonal matrix,and is suitable for modeling long time series.Compared with CNN-LSTM and Video Transformer models,the V-D-W-Transformer model has better convergence and accuracy.For multimodal data,this article adopts a hierarchical processing mode and proposes the V-D-W-A-Multimodal model.This model uses V-D-W-Transformer for local time exploration,BILSTM to extract text semantic features,and fuses video,text semantics,demographic characteristics,and video annotation information at the feature layer.Then,the Attention module is used to aggregate global features,and the accuracy of the model in diagnosing depression reaches 91.7%,The model is suitable for processing multimodal information.The comprehensive evaluation of this article also demonstrates the effectiveness of multimodal fusion in diagnosing depression,as well as the superiority of our method over traditional methods. |