| Facial landmark localization and head pose estimation are important aspects of face analysis and processing,and have always been a research hotspot.Facial landmark localization and head pose estimation algorithms have a wide range of application scenarios in many fields,from behavior analysis,action recognition,to human-computer interaction,social interaction analysis,or virtual/augmented reality,gaze perception,etc.Traditional algorithms in this field are difficult to achieve satisfactory performance in natural scenes such as complex backgrounds,large poses,exaggerated expressions,heavy occlusion,and extreme lighting.In comparison,deep learning algorithms based on convolutional neural networks usually performs very well in complex environments.However,there is still a lot of room for improvement in existing algorithms in terms of the form of supervision,dataset utilization,and the overall efficiency of the algorithm pipeline.This paper conducts in-depth research on facial landmark detection and head pose estimation algorithms,and solves the problems of existing methods through a series of model design improvements.Specifically:1.The existing two types of facial landmark detection algorithms based on heatmap regression or coordinate vector regression are ineffective in constructing face priors and regression accuracy.Few algorithms use heatmap supervision and coordinate vector supervision in network training at the same time,and it is difficult to see breakthroughs and innovations in the form of supervision constraints.In response to this problem,this paper designs an explicit attention mechanism,and uses this mechanism to construct a facial landmark detection network based on the explicit attention mechanism,thereby realizing the joint use of heatmap representation and coordinate vector representation,by applying heatmap supervision to the explicit attention map and effectively merging the attention map with the shallow features of the backbone regression network.The network can effectively suppress the background response and texture response in the input image that are not related to the face structure,focusing on image features strongly related to facial landmarks.This paper also proposes a dynamic loss balancing strategy to further improve the performance of the model.2.The volume of the existing facial landmark dataset is nearly an order of magnitude smaller than that of other basic vision task datasets,which further magnifies the impact of the uneven distribution of the dataset on model training.At the same time,the labeling protocols of many existing data sets are inconsistent,making it difficult for researchers to use multiple datasets for model training at the same time,and the utilization of datasets is very low.In response to this problem,this paper discusses and proposes a new batch normalization module,called the Separabel Batch Normalization layer,which can dynamically generate adaptive mapping parameters according to the input features.The good embedding of this module with the existing excellent network architecture can improve the performance of neural networks,especially lightweight neural networks,on the premise of adding a very small amount of extra computing costs.This paper also proposes a cross-protocol training strategy,using different datassets to construct a mixed dataset for network training,which further verifies the applicability of the proposed module and improves the utilization of the existing dataset.3.Existing head pose estimation algorithms often rely on pre-face detection steps,and the step-by-step pipeline algorithm process design not only increases the inference process and time,but also implicitly constitutes the pose estimation algorithm for face detection.The prior bias of the frame makes the subsequent algorithm upgrade more difficult,and reduces the overall operation and maintenance efficiency of the algorithm.In response to this problem,this paper constructs a face 6DoF pose estimation model,which realizes simultaneous face detection and head pose estimation by regressing the 6-DOF vector of the face.Compared with the existing pipeline algorithm design,the efficiency of the algorithm proposed in this paper is greatly improved.Further,this paper enhances the practicability of the model by lightweight design of the model.In terms of theory,the research results of this paper enrich the supervision form of landmark regression and get rid of the shackles of a single supervision setting.The applicability in deep learning-like training scenarios deepens the understanding of the mechanism of the batch normalization layer.At the same time,in terms of application,the single-model head pose regression method constructed in this paper greatly improves the efficiency of the workflow,and is enlightening for head pose estimation methods in various practical application scenarios. |