| Group behavior recognition has become a hot topic in the fields of computer vision and artificial intelligence,and its value is important for sports event analysis,abnormal behavior detection and early warning,and video classification of real-time crowd scenes.It is also a challenging task,mainly facing two difficulties: the first is how to construct interaction relationships between group members;the second is how to optimize the discriminative spatial-temporal features from time sequence to construct simple behavioral descriptors.Most existing methods use spatial-temporal fully connected network architectures,which can overload the network and affect the training of group behavior recognition algorithm and then affect the recognition accuracy.In this paper,two frameworks are proposed to solve these two problems.They are the framework based on simple modeling of interaction relationships and the framework based on GAT-Transformer,respectively.The main research content and contributions are as follows:(1)A framework for group behavior recognition based on simple modeling of interaction relationships.For the first difficulty,the evolution of interaction relationships between group members is described by the update iteration of the node-connected graph in Graph Convolutional Network(GCN),where each node member is described with appearance characteristics,position characteristics,and trajectory characteristics;GCN can aggregate the member information of all nodes according to the interaction matrix,node feature set and shared weight matrix calculated by the descriptors of all node members,and the nodes with more aggregated information are called key nodes.Interaction relationships constructed only from these sparse key nodes is the simple interaction relationships.For the second difficulty,this framework proposes to use the Intersection Similarity Coefficient(ISC),which is calculated between individual behavior attributes and group behavior categories pre-classified at the current frame,the ISC is used as the temporal weight of the current frame to further simplify the above interaction relationships in order to build a strong distinguishing spatial-temporal feature descriptor in the whole video.Finally,the condensed video descriptor driven by key members and key frames is input into softmax to recognize group/individual behavior.This algorithm achieves average recognition rates of 93.6% and 93.8% on CAD(Collective Activity Dataset)and Volleyball datasets,respectively,and its effectiveness is verified by comparison with other algorithms.(2)A framework for group behavior recognition based on GAT-Transformer.Since GCN determines key members by fusing node member information according to the characteristics of interaction relationships between node members,its disadvantage is that all interaction information(such as appearance,location,etc.)is treated equally,and therefore cannot better integrate interaction relationships between node members into the model,it’s proposed that the evolution of interaction relationships between group members is described by the update iteration of the node-connected graph in Graph Attention Networks(GAT),where each node member is described with pose characteristics,appearance characteristics and position characteristics;the built-in attention mechanism in GAT separates nodes with unequal weight coefficients via attention iterations;and the nodes with larger attention coefficients aggregate more information and are called key nodes.Interaction relationships constructed only by the key nodes are becoming simple.In addition,the intersection similarity coefficients have to be additionally designed as sub-networks to get the member behavior attribute scores,which will increase the complexity of the network,so it is proposed to use two parallel time encoders in the improved Transformer to encode the individual behavior features and the above-mentioned parsimonious interaction features respectively,and then match the two through the spatiotemporal decoder.The encoded individual behavior The feature is regarded as "Key",and the encoded parsimonious group interaction feature is regarded as "Query",and the weights of different frames are compared by calculating the correlation degree of the two according to the self-attentive mechanism and the multi-attentive mechanism,so as to find key temporal segments.The algorithm achieves a breakthrough in group behavior recognition accuracy by using a combination of GAT and Transformer for the first time,achieving 95.5% and 96.1% average recognition rates on the Volleyball dataset and CAD(Collective Activity Dataset),respectively. |