| Video action understanding is one of the most important and challenging problems in computer vision.This thesis considers two specific tasks,namely action recognition and action detection.Recent prevailing solutions directly learn the mapping from input data to output targets through a deep neural network.In practice,this data-driven approach often suffers overfitting problems without sufficient training data.In this thesis,we explore several loss function designs based on regularization,which introduce specific “prior knowledge” to alleviate model over-fitting.For action recognition,one challenge is that the classes of actions are often defined with mixed granularity,i.e.,the differences between classes are uneven,and some actions are more difficult to distinguish than others.We propose a two-branch network from universal to specific,in which the universal branch is used to learn the general characteristics to separate most action categories,while the specific branch focuses on the characteristics used to distinguish some specific confusing categories.Between the two branches,a category regularization block is introduced,which takes the output of the universal branch as input and learns category-specific masks to regularize the specific branch so that the class-dependent subtle differences are captured.Experimental results demonstrate its effectiveness on three public datasets when combining the universal and specific features for action recognition.For offline temporal action detection,we first consider the problem of over-fitting training data in the existing “bottom-up” action detection framework when evaluating the actionness i.e.,action phases of starting,continuing,and ending.Due to the independent modeling and combination of different action phases to form action proposals,the potential causality and exclusion within and between action phases are ignored.We propose an action-phase regularization method that applies intra-phase and inter-phase constraints to evaluate the action phases.So that high-quality action proposals can be generated from the predicted action phases.Furthermore,we consider another “anchor-free” framework which takes a temporal point rather than a temporal window to represent an action.In this framework,it is more flexible to represent action instances but requires each point feature to cover the entire action scope.Therefore,an actionness regularization is designed,which takes the predicted actionness as the attention mask to select action regions for point proposals.Experimental results show that the introduction of action-phase or actionness regularization can significantly improve the performance of action detection in the two detection frameworks.For some application scenarios that require real-time action detection,we explore online temporal action detection based on the video stream.The challenge is that once the action starts,even if it has not been completed,it needs to be able to detect the action.Considering that the training data contains the overall video frame sequence of actions,given the video stream to be predicted,we regard its subsequent video frames as a kind of privileged information and propose a progressive privileged knowledge distillation framework to assist in training the online model through the offline model.Privilege information can be regarded as an implicit regularization in the distillation process.It is worth noting that the difference between the teacher and student models mainly lies in the input data rather than the network architecture.To reduce the impact caused by the difference of input data,we propose a simple but effective method that enforces knowledge distillation loss to partially hidden features of the student model and designs a curriculum learning procedure to gradually distill the privileged information.Compared with some methods that explicitly predict future frames or features,our method avoids the prediction stage and obtains better performance.In two public datasets,the proposed distillation method can effectively improve the detection accuracy of online models. |