| With the rapid growth of video and the success of deep learning,the understand-ing of human video actions has become an extremely hot and challenging area in com-puter vision,especially,for action recognition and action temporal detection.Due to the complexity of human actions,Action recognition and action temporal detection are generally regarded as high-level problems in the field of video understanding,most of the existing high-level video action recognition methods lack the use of details and fine-grained middle-level semantic information,and the uncertainty of boundary proposals also leads to the difficulty of extraction accuracy for the existing video action tempo-ral detection methods.Therefore,video action recognition and video action temporal detection methods require more fine-grained and accurate datasets to support.Exist-ing video datasets ignore the middle-level understanding of all body parts and coarse instances with uncertain boundaries interfere with proposal generation and action pre-diction.Therefore,in this paper,construct two more fine-grained,accurate datasets in spatial and temporal dimensions.To further deepen the understanding of actions,this paper delves into interpretable action recognition in video by explicitly disentan-gling human actions into the spatio-temporal composition of body parts and interacting objects.Specifically,a large-scale ExplainAction benchmark is built for this study,pro-vides 9.5 million annotations of 10 body parts,8.7 million gestures,and 230 interactive objects at the frame level,meanwhile,provides new opportunities to understand human actions by learning the components of body parts in videos.With ExplainAction,a compositional approach with interpretability can be further exploited to improve action recognition performance.On the other hand,this paper develops RefineAction,a new large-scale refined video dataset collected from existing video datasets and web videos Specifically,RefineAction contains 139K refined action instances,densely annotated in nearly 17,000 untrimmed videos across 106 action categories.Compared with exist-ing action localization datasets,Refine Action has finer action category definitions and high-quality annotations to reduce boundary uncertainties.And it is shown through ex-periment results that the overlap of instances and the diverse durations of RefineAction bring new challenges for action timing detection. |