| With the development of "Internet+" education,the online classroom has broken through the limitations of time and space,but there are also shortcomings,such as the teacher’s inability to understand students’ cognitive status in the classroom in a face-toface manner.Therefore,knowing students’ attention levels in online classrooms is a key issue.The purpose of this thesis is to use deep learning to assess students’ attention levels in online classrooms,including face detection,expression recognition,and posture estimation,and to validate and analyze experiments on attention state determination based on these models.This thesis covers three main works as follows:First,an attentional state detection model based on expression recognition is investigated.In this thesis,the theory related to face detection and expression recognition is thoroughly studied,and MTCNN networks(Multi-task Cascaded Convolutional Networks)are used for face detection,and on the face dataset WIDER FACE and Celeb A(Celeb Faces Attribute)validation sets.Io U(Intersection over Union)of 0.6 achieves 91.8% accuracy and 80.8% recall.Then four video classification algorithms and a model proposed in this thesis were implemented.After analysis and comparison,the Transformer-based video classification network proposed in this thesis was selected as the attention state classification framework,and the video classification accuracy reached 87.93% on the validation set of Yaw DD(Yawning Detection Dataset)for the fatigue driving dataset,with a recall rate of 84.26%,which is 4 percentage points and 3 percentage points better than the suboptimal LSTM(Long Short Term Memory)based framework.Then,an attentional state detection model based on pose estimation is investigated.In this thesis,the theory related to pose estimation and attention mechanism is thoroughly studied,and the spatial attention mechanism and channel attention mechanism are introduced into the high-resolution multi-scale parallel network to improve the accuracy of HRNet(High Resolution Net)network,and the model accuracy on MS COCO(Common Objects in Context)dataset Then,based on the extracted key point information of students’ upper limbs and heads,we analyzed and compared the performance of five types of video classification frameworks and selected the Transformer-based framework proposed in this thesis,which achieved an accuracy of 85.14% and a recall of 83.98% on the validation set of the self-acquired student online classroom dataset,which are 4 percentage points and 3 percentage points higher than the suboptimal LSTM-based model,respectively.Finally,a model effect validation experiment based on multidimensional features of expressions and gestures was designed.Students’ attention states in the online classroom are classified as leaving state,distracted state and normal state,combined with expression recognition and posture estimation to determine students’ classroom attention states,and the speed of image acquisition is adjusted to improve the real-time performance of the model.The model is validated on a small scale by recording the students’ online classroom dataset by ourselves.The test experimental results show that the model can accurately classify students’ attention states in the online classroom with an accuracy rate of 83.6% and a recall rate of 81.7%,which can better meet the requirements of online teaching. |