| With the increasing development of technology,3D gaze estimation has attracted more and more research attention due to its potential use in various applications.Through gaze estimation technology,it is possible to infer the content of human interest,and then study their psychological activities.In terms of human-computer interaction,gaze estimation technology can allow patients who cannot speak with their hand organs not suitable for movement to integrate into people’s daily communication.In terms of assisted driving of automobiles,the sight estimation task can also be applied to automobiles to detect whether the driver is out of vigilance and prevent traffic accidents caused by fatigue driving or distraction.At present,the study of gaze estimation has achieved great development,but in the face of estimating the3 D gaze direction from 2D eye images,there will be great uncertainty,which poses a huge challenge to the existing deterministic methods.In recent years,Convolutional Neural Network(CNN)has become a hot topic in the field of computer vision research,and its excellent feature extraction and generalization capabilities have been widely used in various computer vision tasks.Similarly,CNN can effectively solve the key problem of accurately estimating the line of sight in a complex external environment.However,the current line-of-sight estimation network often adopts multi-column or multi-scale neural network structure design.This type of design has a series of problems such as parameter redundancy,difficulty in training,and gradient disappearance or explosion,which causes the performance of the final model to be severely restricted.The accuracy of line-of-sight estimation is greatly reduced.To address this series of challenges,this paper designs a line-of-sight estimation network(CGE)under the framework of variational inference.The traditional CNN network training method is to represent the input eye image as x and consider the regression of the line-of-sight estimation value g.The process of conventional line-of-sight estimation may be complicated.Therefore,unlike the traditional CNN network for end-to-end training on the sight line estimation task,we divide the entire training session f into two steps.Our approach assumes that it is possible to learn an intermediate image representation m of the eye,that is,we divide the model into two parts: j and k.The complexity of the network learning j and k should be significantly lower than the complexity of directly learning f,so that we can choose some significantly lower complexity neural network architectures with higher or equivalent performance for the same task of line-of-sight estimation.In process j,we apply the method of variational inference,the purpose of which is to generate gaze representations(gazemaps).During the training phase,the encoder in the posterior network takes as input the ground truth values ??of eye images and corresponding gazemaps and transforms them into a latent space with many gaussian distributed variables.Multiple random variables are subsequently sampled in the latent space and reconstructed into corresponding gazemaps by the decoder,which also serves as an intermediate supervision for the entire network.In process k,we will input the gazemaps generated in the previous process j into the regression network to estimate the gaze direction.Our regression network is based on the network structure of Dense Net.In order to verify the effectiveness of the network,we have also conducted comparative experiments on effectiveness analysis on some other classic network frameworks.Finally,we evaluate our model on three benchmark datasets(MPIIGaze,EYEDIAP,and Columbia),and experiments show that CGE can accurately estimate the 3D gaze direction. |