| With the widespread adoption and development of the internet and mobile devices,the accumulation of video data has reached unprecedented levels.Automatically analyzing and processing video data has become one of the hot research areas.Among them,the task of video object grounding has gained considerable attention from both academia and industry due to its utilization of natural language descriptions to identify and process specific content in videos,which is more aligned with human thinking processes.This thesis focuses on a branch of video object localization called semantic role object grounding,which aims to return a sequence of location information for specific visual objects that align with the semantic role descriptions in the query statement from the video.The research challenges of this task include:(1)learning effective representations for semantic role features,(2)distinguishing different instance objects within the same category,and(3)achieving cross-modal semantic alignment.Traditional research on video object grounding only requires locating visual objects in the video that match the query statement and does not differentiate between instance objects with similar visual appearances but the same category label,thus it is not suitable for the semantic role object grounding task.Although existing algorithms for video semantic role grounding consider the visual relationships between objects and the semantic relationships between roles and objects,they overlook the impact of noisy proposal objects on visual features and the influence of proposal objects with the same category label on cross-modal semantic alignment,resulting in poor performance of current algorithms for semantic role object grounding.To further improve the performance of semantic role grounding methods,this thesis proposes a novel model architecture based on a hybrid attention mechanism for video object grounding.Specifically,(1)to learn effective semantic role feature representations,a semantic role understanding module is designed to obtain better representations and more comprehensive features of semantic role phrases.(2)To suppress irrelevant information in visual representations,a semantic-aware proposal object refinement module is designed to mitigate the influence of noisy proposals on visual feature learning,further exploring subtle semantic differences between semantic roles and visual objects.Additionally,(3)to enhance cross-modal semantic alignment accuracy and reduce the interference of proposal objects in the query text that are irrelevant to the semantic roles,a proposal object contrastive learning loss function is proposed.This loss function narrows the distance between the target object and the semantic role while enlarging the distance between other objects and the semantic role,thereby optimizing the matching between the target object and semantic role in a heterogeneous space,ultimately improving the accuracy of video object grounding.To validate the contributions of the proposed hybrid attention mechanism-based object grounding method for the task of semantic role grounding,extensive experiments are conducted on four public datasets: ASRL-SPAT,ASRL-TEMP,ASRL-SVSQ,and ASRL-SEP.The experiments include:(1)comparing with state-of-the-art baseline models to demonstrate the superiority of the proposed model,(2)conducting ablation experiments on the model modules to demonstrate the effectiveness of each module,and(3)visualizing the attention weight coefficients of the modules to justify the proposed hypotheses.By analyzing the results of the aforementioned experiments,we aim to prove the contributions of the proposed hybrid attention mechanism-based object grounding method for the task of semantic role grounding. |