| In the rapidly developing era of big data,video data has become a critical carrier for information exchange.As an important research of the computer vision field,videobased human action understanding plays a huge role in smart home,security control and other scenarios.At the same time,with the rise of depth capturing devices and pose estimation algorithms,skeleton data filters out the noise of video background and character appearance,enabling more precise extraction of action features for downstream tasks.However,one of the major challenges in action understanding is the problem of view change.The change of camera position leads to different pose occlusion and information loss of the same action,reducing the ability of the network to correctly recognize and analyze the action,thus the representation of different-view action sequences needs to be unified.Based on the above problems,this thesis explores robust view-invariant action understanding techniques from three aspects:(1)knowledge transfer between views,(2)knowledge contrast between views,(3)knowledge fusion from multiple views.A series of researches including skeleton-based view-invariant action recognition algorithms are carried out,and the difficulties and future research directions are emphatically discussed.The overall works can be summarized as follows:(1)This thesis proposes a view transformation network for view normalization,which guides the input arbitrary-view action sequence to be converted into a unified base view,and conducts joint attention learning between the view-transformed and referenced baseview action sequences to assist more accurate view transformation.Finally,the feature extraction network is employed to obtain the action features of the transformed view for classification.Knowledge transfer based on the base view and co-attention mechanism are presented to obtain the best recognition performance on three large-scale action datasets:UESTC,NTU 60,and NTU 120.(2)This thesis proposes a spatio-temporal cross-view feature learning method for view-common feature extraction.Based on graph-based feature fusion,cross-view spatiotemporal graph and corresponding graph and temporal convolutions are constructed on the spatial and temporal dimension,realizing information interaction of different-view action sequences,obtaining view-common representation.Detailed ablation studies and feature visualization results demonstrate the validity of each component.(3)This thesis further proposes Fisher contrastive learning,which directs that action features from different views can be represented in a same semantic space.At first,view disentanglement is realized to get view-specific action features except view-common ones.Then combining contrastive learning with Fisher discrimination,forming view-term and semantic-term Fisher contrastive learning,producing robust view-invariant actiondiscriminative features.Through graph-based view fusion,view disentanglement,and view contrastive learning,the model achieves the highest accuracy of action classification on same three large-scale action datasets.Three works produce a multi-view action recognition model based on view transformation network,and a cross-view Fisher contrastive action discrimination model based on spatio-temporal information.They respectively focus on common view and common feature,learning the unified representation of different-view action sequences to realize arbitrary-view action recognition. |