As one of the most prevalent mental illnesses at present,depression not only causes people’s mental state to deteriorate,it can also impair an individual’s productivity in their work life and even induce suicidal behavior.In order to improve the diagnosis and treatment efficiency of depression,the development of automatic detection algorithm for depression has become a hot spot in the research of affective computing.However,previous methods often ignore the dialogue structure characteristics of the interview text and the compatibility with other modalities,and the depression labels corresponding to these data are often unbalanced,which limits the effectiveness of the automatic detection algorithm for depression.To this end,this paper improves the generalization performance of the algorithm from the following perspectives:Firstly,using variational reasoning and re-weighting strategies,a joint solution to the problem of inconsistent context and imbalanced regression in depression detection is proposed,and a link is established between the two by introducing depression symptom labels.Specifically,we propose a depression detection model called Conditional Variational Topic-Enriched Auto-encoder,which captures spatial features from local topic information via variational inference and captures temporal information from global contextual information using an attention mechanism.We also topic information to enrich local and global information at the same time,so as to obtain spatial-temporal features with stronger representation ability,and solve the problem of inconsistent context.In addition,we apply a re-weighting strategy to redistribute the weights of the training loss for different depression labels,thereby suppressing the model from overfitting training on common label samples,and through the linear relationship between depression symptom labels and depression labels,we can learn and predict missing values in depression labels,which address imbalance regression problem.Secondly,multimodal factorization is introduced to simultaneously learn intermodal and cross-modal inter-relationships between visual and textual modalities.More specifically,by fusing multi-task learning and variational inference,we propose a multimodal factorized auto-encoder architecture based on visual and textual inputs for depression detection,allowing textual and visual modalities to depression cues for complementary learning.At the beginning,we use a backbone network to extract visual and textual features separately.At the same time,we introduce a memory fusion network to obtain cross-modal features to initially learn complementary information between modalities.Then,we employ multimodal factorization to establish the link between uni-modal and multimodal latent variable spaces,which is crucial for eliminating redundant information across modalities.In this dissertation,aiming at the dialogue content data in the depression interview text,we design comparative experiments on the Chinese and English data to prove the effectiveness and robustness of the algorithm. |