| Group activity recognition is a crucial and challenging problem that has recently attracted increasing attention in the field of video understanding.Group activity recognition aims to recognize overall activity in multi-person scenes,and it has promising applications,such as sports video analysis,security systems,social behavior understanding,video search and retrieval,etc.The core of group activity recognition lies not only in recognizing the actions of individual actors,but also in fully exploring the scene information and the interactions among individuals.However,some previous methods have only dedicated to reasoning on individual actor features,but neglected to model scene information that often has some clues for reasoning about the group’s activity.For modeling the spatial and temporal relationships between individual actors,previous approaches either capture the spatial and temporal relationships separately or aggregate individual features directly to generate group representations,which makes it difficult to optimize the model in both spatial and temporal dimensions,and the generated group features are not sufficient in semantics and relevance.In order to solve the mentioned problems,this paper proposes a group activity recognition model based on cross-time step dynamic graphs,and this branch takes the appearance features of individuals as input.Specifically,the scene context information is first encoded into the appearance features of individuals using the Transformer-based scene encoding module,so that the appearance features of individuals are associated with the scene they are in and thus enhanced.Subsequently,the interaction of the individual features encoded by the scene context is further explored.Specifically,a spatial-temporal graph is formed from all individuals in different frames across time steps,and the spatial-temporal relationships of all individuals are learned by a dynamic graph convolutional network.Finally,these individual features with spatial-temporal relationships are globally pooled to obtain the group features.Moreover,on two widely used datasets,namely Volleyball Dataset and Collective Activity Dataset,the proposed method achieves results that compete with state-of-the-art methods,and experimental results demonstrate the effectiveness of each proposed module.The pose features of individual actors usually play a guiding role in learning of group activity,which not only facilitate the accurate recognition of individual actions,but also contain key cues for group activity.Therefore,this paper proposes a group activity recognition model mixed with pose features,which aims to fully learn the spatial-temporal relationships in pose features and obtain informative group representations for activity prediction.Firstly,the pose estimation backbone network is employed to predict the key-point coordinates of all individuals annotated in the dataset,which are transformed into pose features by linear projection.Similarly,the scene context information is encoded into the pose features of individuals,and then the cross-time dynamic graph convolution module is utilized to reason about the spatial-temporal relationships.Furthermore,the pose-based branch and the appearance-based branch are used to learn the scene context and reason about spatial-temporal interactions respectively,and then the two branches are fused for the final group activity prediction.Extensive experiments verify the effectiveness and rationality of each branch,and the method achieves results that compete with state-of-the-art models on both datasets,while achieving accuracy improvements over single branch. |