| In recent years,with the continuous development of artificial intelligence technology and the strong social demand for security surveillance,research in the direction of behavior recognition has gradually increased.Group behavior recognition aims to infer the collective activity performed by a group of people in a given scenario.Recognizing group behavior requires an understanding of individual behavior characteristics,scene features,and modeling of group interaction relationships.The extensive researches on this topic hold significant academic value in the field of computer vision and also have substantial practical applications in society,such as sports video analysis,video search and retrieval,visual intelligent robots,and abnormal behavior monitoring in surveillance.In this paper,our focus is mainly on the two key challenges in the task to build an effective recognition algorithm,namely the effective integration of multi-modal features of individual participants;and the optimization of the modeling and inference of interaction relationships among participants.The main research contents and contributions are as follows:An algorithmic framework called Adaptive multi-modal Fusion and Implicit Relationship Learning(AFRL)has been proposed.It addresses the problem of significant information loss that can occur when combining heterogeneous and feature-rich member characteristics solely through cascaded stacking fusion operations.A solution to this problem involves the design of a dual-stream adaptive multi-modal fusion module.This module connects posture features and optical flow features into a unimodal fusion,compresses the significant information potential vector,and uses it to guide the iterative calculation of the loss function between the latent vector and the original feature information.By doing so,the salient features are strengthened,and streamlined member modeling is achieved.In the group member interaction relationship learning phase,the algorithm designs an implicit interaction modeling module to address the problem of inaccurate description of long-range participant interaction dependencies.This module uses the self-attention mechanism of the Transformer encoder network to learn the interactions between cluster members by calculating the similarity of the appearance of paired feature vectors through association strength.By assigning similarity calculation scores to selectively extract persona information important for behavioral recognition,capturing participant spatial structure information to model and reason about dependencies among group members,while implicitly modeling appearance and location relationships among persons without relying on any a priori spatial or temporal structure.The algorithmic framework achieved an average recognition accuracy of 92.4% and 93.7% on the publicly available Collective Activity Dataset(CAD)and Volleyball Dataset(VD),respectively,demonstrating the effectiveness of the approach.In addition,an algorithmic framework based on Selective Feature Fusion and Dynamic Relational Inference(SFDRI)is proposed in this paper.Instead of simply using the compressive reconstruction adaptive fusion in the AFRL model for multimodal fusion problems,the algorithm incorporates a selective feature fusion module.This module uses a random function probability distribution score to select the most relevant feature representation through a resampling method based on different modal features.In modeling the interaction relations among inferred members,we perform the optimization of the participant relationship inference module and design the dynamic relationship inference module from the problem of how to model the spatio-temporal contextual relations of specific members and capture long-distance structural relations.The interaction area is initialized on the spatio-temporal graph,and the dynamic offsets of each character feature are added to form an individual-specific interaction graph for information transfer while using the embedded dot product to calculate the interaction relationships between members.This learned sampling allows the network to efficiently collect long-range contextual background by selecting only a subset of the most relevant nodes in the spatio-temporal graph.Finally,the final group behavior is identified by iteratively updating the features on the interaction graph.This method alleviates the previous problem of establishing interactions between individual members on predefined graphs not applicable to all behavioral data cases and dynamically learns global information using local interaction regions.The algorithm achieved better recognition performance compared to the most advanced group behavior recognition algorithm models on public volleyball datasets and collective activity datasets,demonstrating the effectiveness and potential of this research work. |