| In a large amount of video data,emotional information is an extremely important semantic information,as it can directly reflect the emotional impact of video content on the audience.However,in existing research on video emotion recognition algorithms,the utilization of visual and textual modality information is still not sufficient,mainly manifested in:(1)When analyzing emotional content from video visual modality data,the relationship between video temporal information and spatial information is often ignored;(2)When analyzing emotional content from video textual modality data,all sentences are treated as equally important,lacking attention to key semantic information;(3)The hidden correlation between visual and textual modality information in the video is not fully utilized,and thus it is necessary to explore the relationship between visual and textual information;(4)In AI interview scenarios,the emotional expression of interviewees is crucial to the evaluation of interview performance.To address the above issues,this paper proposes the following methods:(1)In order to fully explore the relationship between temporal and spatial information in visual data,this paper proposes a video emotion recognition method that integrates temporal and spatial features using a transformer encoder network to obtain the video’s temporal and spatial feature vector.Experimental results show that the spatial-temporal feature fusion network has improved performance in video visual feature extraction compared to traditional networks in various performance metrics.(2)In order to fully explore the key semantic information in textual data,this paper proposes a Bi-LSTM-Attention network based on an attention mechanism,which can extract the more important semantic information from the text.Experimental results show that the Bi-LSTM-Attention network has better performance in emotional content recognition compared to traditional methods.(3)In order to fully explore the hidden correlation between visual and textual modality information,two multimodal fusion methods were experimentally tested.In terms of feature fusion,this paper proposes a feature fusion method based on low-rank bilinear pooling,which can effectively reduce the number of parameters in the feature fusion process.In terms of decision fusion,this paper proposes a method based on matrix factorization,which transforms the calculation of the optimal weight matrix into a problem of finding the maximum accuracy in a five-dimensional space,and can allocate weights to different emotions effectively.Experimental results show that the video emotion recognition method based on feature-level fusion and decision-level fusion proposed in this paper has improved performance in various performance metrics compared to singlemodal video emotion recognition methods,and the decision-level fusion method performs better than the feature-level fusion method on the dataset used in this paper.(4)Combining the specific scenario of AI interviews,this paper applies emotion recognition algorithms to AI interview video emotion recognition.The system is designed based on the high cohesion and low coupling principles in software development,using a front-end and back-end separation architecture,and the business logic and algorithm logic of the system are separated.Functional and performance tests show that the emotion recognition algorithm can be well applied to the system developed in this paper. |