| In the past decade,the rise and development of deep convolutional neural networks have led to significant progress in human-centered visual perception computing,particularly in the field of facial expression recognition.This article aims to address the issues of lightweight and multimodal integration in the field of facial expression recognition,and proposes multiple improvement schemes to enhance the accuracy and efficiency of both unimodal and multimodal facial expression recognition.The main work of this article includes:In response to the problem of the existing lightweight networks having room for improvement in recognition accuracy for unimodal static facial expression recognition,this study proposes a series of improvement schemes based on the lightweight Mobile Net V3-Small model,which greatly improves the precision of identifying facial expressions.Specifically,the study employs different Bneck simplification strategies to reduce the model parameters and improve the network’s resistance to overfitting.It also uses an attention mechanism to enhance the model’s sensitivity to facial expression features,constructs a deep-shallow feature fusion network to obtain multi-scale facial expression information,and applies transfer learning to optimize the training strategy and accelerate network convergence while improving recognition accuracy.Experimental results show that the proposed schemes outperform the original model in a self-made mixed facial expression dataset.The optimal simplification method,which reduces 18% of Bneck parameters,can suppress 5% of overfitting.The proposed CTAM-Mobile Net V3 s improves the average recognition accuracy by 5.64%,and the deep-shallow feature fusion network improves facial expression recognition accuracy by 3.14%.To address the issue of the complexity and lack of lightweightness of bimodal facial expression recognition models,this study proposes a bimodal emotion recognition model FSANet,which integrates facial expressions and speech based on the VAANet framework.The study introduces the CTAM-Mobile Net V3 s which employs 3D convolution as the backbone feature extraction network for the visual stream,and uses coordinate attention mechanism to replace the original spatial attention mechanism.Experimental results show that,on the public emotion recognition datasets e NTERFACE’05 and RAVDESS,the accuracy of FSANet is respectively 6.17% and3.90% higher than that of VAANet,while the model size and number of parameters are only 1/3 and 1/7 of VAANet,significantly reducing model complexity.This thesis applies the proposed model methods to design and implement an expression recognition system in practical scenarios.The system mainly includes two core modules: static image expression recognition and bimodal emotion recognition that integrates facial expression and speech.This system provides strong support for emotional analysis in real-world scenarios. |