| Emotion is one of the most important and complex mechanisms of human beings.It is always a major problem in the field of artificial intelligence to make the machine well understand and even express human emotions.Some mental illnesses are manifested in human emotional changes,and depression is one of them.Depression is a disease that causes humans to be in low spirits,feel retarded and despondent.It affects the health of a normal person in terms of thought,behavior,and emotional perception.In severe cases,it causes self-harm or even suicidal behavior.Depression has a very serious impact on human health and has received increasing attention in the international.This paper wants to establish a multimodal collaborative classification model by means of sentiment analysis.The model is used to analyze the characteristic data of patients with depression and predict whether the other patient has a tendency to depression.In the traditional methods of diagnosis of depression,most of them are judged by professional doctors in the form of self-examination questionnaires and scales.The nature of these questionnaires and scales,as well as the clinical diagnosis of physicians,is to understand and master the patient’s usual psychological state,physical condition,and emotional changes.This is similar to sentiment analysis.In the emotional calculation,the research methods used to identify emotional changes are mostly analyzed frol facial features,speech features,EEG signals,and skin electrical signals.This paper analyzes the characteristic data of patients with depression from the perspectives of facial features,phonetic features and textual features.The multimodal collaborative classification model proposed in this study combines language text features,audio features,and facial feature data during interviews with depression patients.The corresponding algorithms were used to deal with the three types of features.Finally,combined with the analysis results of the three types of features,the predicted values of the depression tendency of the patients were obtained.The model is mainly composed of the following three parts:(1)The audio data recorded in the interview is analyzed based on the voice feature part.The audio features provided by the data set are extracted from the audio log file by the COVAREP algorithm.Each 0.3334s is a timestamp and the extracted audio features are recorded under each timestamp.According to the timing characteristics of audio features,a long short-term memory network(LSTM)is established.At the same time,the data sets are classified according to gender.These features are used as input of long short-term memory network(LSTM)according to the timestamp order,and a prediction based on audio feature is obtained.(2)Based on the facial features,a variety of facial characterization data recorded in the interviews were analyzed.In the original data set,features such as face artificial coding AUs,eye GAZE,action POSE,and face 3D features are provided.These features were extracted from the interview video by professional software.Each 0.3334s is a timestamp,and feature data is recorded under each timestamp.Establish a long short-term memory network(LSTM)based on the temporal characteristics of the data.At the same time,the data sets are classified according to gender,and the classified facial features are input into the long-term and short-term memory network(LSTM)according to the timestamp sequence for classification and prediction.(3)Based on the facial features,a variety of facial characterization data recorded in the interviews were analyzed.In the original data set,features such as face artificial coding AUs,eye GAZE,action POSE,and face 3D features are provided.These features were extracted from the interview video by professional software.Each 0.3334s is a timestamp,and feature data is recorded under each timestamp.Establish a long short-term memory network(LSTM)based on the temporal characteristics of the data.At the same time,the data sets are classified according to gender,and the classified facial features are input into the long-term and short-term memory network(LSTM)according to the timestamp sequence for classification and prediction.Finally,the three prediction results are analyzed to found some similarities between the results.The experiment linearly combines the predicted results to give a linear equation:f=w1outf+w2outt+W3outc+w4,where w1,w2,w3,w4 are function coefficients,outf is the prediction result of facial features,outt is the text feature prediction result,outc is the audio feature prediction result,and fis the final output.The training set is used to fit the equation coefficients so that the test set results satisfy the equation constraints.The accuracy of the final test set prediction result is 86.73%.The model has certain applicability value. |