Facial expression,which carries human emotional information,is regarded as a universal language that transcends ethnic and cultural diversity.Conducting in-depth research on facial expression recognition is helpful to better understand human emotional states and psychological activities,and to achieve more intelligent interaction between human and computer.Dynamic expression sequences contain rich temporal and spatial information,which can effectively reflect the changing process of facial expression.Therefore,the research of expression recognition based on video has become a vital research direction of the new generation of human-computer interaction system.The expression recognition is divided into discrete and continuous representation models,and these two emotional models have great application value in many fields of social life.For example,the retail industry evaluates customers' preferences for goods by identifying their basic expressions during shopping.The online education industry applies continuous dimension expression recognition to monitor the status of students and refine the course quality analysis.In this paper,discrete expression classification and continuous expression regression in video sequences are studied.The main work includes the following two aspects:(1)LSTM network is widely used in facial expression recognition of video sequences.In view of the limited representation ability of single-layer LSTM and the limitation of its generalization ability when solving complex problems,a hierarchical attention model is proposed.Hierarchical representation of time series data is learned by stacking LSTM,self-attention mechanism is used to construct differentiated hierarchical relationships,and a penalty term is constructed and further combined with the loss function to optimize the network performance.Experiments on CK+ and MMI datasets,demonstrate that due to the construction of good hierarchical features,each step in time series can select information from the more interesting feature hierarchy.Compared with ordinary single-layer LSTM,hierarchical attention model can express the emotional information of video sequences more effectively.(2)Auxiliary learning can improve the performance of main tasks,however auxiliary task often need manual annotation,which will consume a lot of time and manpower.We propose a method of self-auxiliary learning in this paper,which takes the continuous dimension(arousal-valence model)emotion estimation as the maintask,and the discrete emotion classification as auxiliary task.This approach is to train two neural networks: a label-generation network to adaptively create labels of the auxiliary task,and a multi-task network to train the primary task alongside the auxiliary task aiming at a best result for the main task.The two networks interact with each other via a form of meta-learning and continuously optimize the performance of the model through iterations.Self-auxiliary learning can avoid manually marking the auxiliary task,thus it is possible to improve the performance of original task with the help of auxiliary learning for some datasets that do not contain labels of additional task.The evaluation experiments based on the Recola dataset verify the effectiveness of the proposed continuous dimension emotion recognition algorithm. |