| This thesis targets Vision-based Human Action Recognition(Vi-HAR),which is the task of detecting the exact type of human action from visual data(i.e.,skeleton data,RGB frames).Human action recognition is now a hot topic in the computer vision field,playing a major role in various up-to-date applications,which include,but are not limited to: video surveillance,healthcare,human-computer interaction,and video retrieval.Growing interest has been dedicated to this area due to the advances of deep learning techniques and the availability of human action datasets,resulting in huge safety and healthcare benefits.From classical handcrafted-based methods to deep learning-based methods,various techniques have been proposed for effective and efficient action recognition.However,many open issues still exist,which keep the action recognition task far from being solved.Some of these challenging issues include(i)the enormous variations in visual and motion appearance of people and actions.(ii)the sensitivity of vision-based action recognition models toward the environment and recording settings.(iii)the high computational resources and memory requirements that are needed for this task.In this thesis,and to solve these thorny problems,we explore different ways to design effective and efficient human action recognition models.Considering the fact that action recognition is a classification problem,our main concern is learning robust and discriminative representations for the task of classification by effectively modeling spatial and temporal features in visual data.To achieve that,we develop three solid Human Action Recognition(HAR)models based on deep neural networks along with the attention mechanism,which is a complex cognitive function has arguably become one of the most semantic concepts in the deep learning field.As a matter of comprehensiveness,our proposed models utilize different data modalities.For instance,two models are skeleton-based HAR,while the third is RGB-based HAR.In the first part of the thesis,we studied,reviewed,and introduced prominent state-of-the-art techniques.We categorized the previous actionrecognition techniques based on the data modality utilized in the models into: skeleton modality and RGB modality.Further,we introduced the attention mechanisms in deep learning as well as the temporal modeling in action recognition.Based on that,the contributions of this thesis can be divided into three main groups,in which each group represents the contribution of one of the three proposed models as described below:First,we introduced an Enhanced Discriminative Graph Convolutional Network(ED-GCN)for skeleton-based action recognition by integrating the Squeeze and Excitation module to the GCN.By this integration,the network is capable of using the global information to selectively enhance significant features,and hence,enhance the recognition accuracy.Further,we proposed an adaptive temporal modeling block(ATB),a sequential two-stage design consisting of re-calibration and motion-interaction stage to model the temporal motion features,which are complementary with the spatial dimension modeling,resulting in improving the recognition accuracy.Second,and motivated by finding that each feature channel focuses on a specific pattern,we proposed our second recognition model,which is mainly an enhanced graph convolution network with an enriched graph topology representation for skeleton-based action recognition.In this model,we aligned the graph learning on the channel level by introducing a graph convolution with an enriched topology based on attentive channel-wise correlations(ACC-GC).Our ACC-GC learns a shared graph topology over different channels and augments it with attentive channel-wise correlations,giving the model the capability of learning channel-wise enriched topologies.Third,we shifted our view to using RGB-frames in video as another data modality due to the fact that besides the skeleton data,different other data modalities,such as RGB,depth,infrared,point cloud,event stream,and audio,can also be used to represent human actions.Our third model is proposed to solve the HAR problem using the RGB modality.In detail,we proposed a learning representational module,the Collaborative PositionalMotion Excitation Module(CPME),to effectively and jointly capture appropriate channel-wise features and motion information for action recognition in videos.Moreover,we proposed a simple yet effective 2D-CNN based module,namely CPME-Net to learn a discriminative video-level representation for action recognition,our approach is a plug-and-play module,which allows it to serve as a plug-and-play operation in a wide range of 2D-CNN-based action recognition architectures.Each recognition model is explicated and analyzed in detail in a specific chapter.Finally,we show that the proposed models obtain better or similar results in comparison to the state-of-the-art on various,real,and challenging human action recognition datasets(NTU-RGBD,Kinetics-Skeleton,Northwestern-UCLA,UCF101 and HMDB51). |