As a bridge between vision and language,relational detection is a way to gain a more comprehensive understanding of visual content.Relational detection requires a good generalisation and integration of visual features,semantic information and motion information between different objects,but traditional neural networks have some shortcomings,for example,they often require fixed-length inputs when processing sequential data and are unable to handle variable-length sequences,and have difficulties in modelling long-term dependency relationships in video.The transfer of information in traditional neural networks is carried out through fixed connections that cannot take into account the complex relationships between nodes,while problems such as gradient disappearance or explosion can easily occur when dealing with long sequential data.In addition,video data has a temporal dimension,and how to make good use of timedomain information for correct and efficient video relationship detection has become a hot topic of research in recent years.Most previous works have used convolutional or recurrent neural networks for video visual relationship detection,and these methods cannot capture the spatio-temporal information of long time sequences well,and also have problems in inefficiency.After the above problems,the main research work in this paper can be divided into three points as follows:· A Transformer-based video relationship detection algorithm is proposed.The model named Vrd Tran consisting of two core modules,a spatial encoder and a temporal decoder,can better capture the temporal dependencies between multiple frames of an object and reason about the dynamic relationships.Experiments demonstrate the effectiveness of the Vrd Tran model for enhancing time-domain information for video relationship detection.· A graph attention-based video relationship detection algorithm is proposed.The model named ST-a GCN is formed by a relation generation network and an attentional graph convolution.ST-a GCN uses a relation graph generation network to construct relation triples into relation graphs with spatial information,and an attention-based graph convolution network to fuse spatio-temporal contextual information of multi-frame relation graphs.Experiments demonstrate the help of graph structure for reasoning about relationships,and the proposed STa GCN model outperforms benchmark video relationship detection algorithms in the video relationship detection task.· A video relationship detection algorithm based on the graph Transformer is proposed.Combining the advantages of graph structure for local relational inference and Transformer for capturing global information of video,the graph Transformer network VGTran is designed,which includes spatial encoder,relational graph generation,position encoding and graph time decoder modules.The feasibility of the Transformer network for processing graph structure data is also demonstrated in experiments,and the performance of VGTran over other benchmark video relationship detection algorithms is verified.Models based on the attention mechanism can better handle long sequence data like video.In this paper the model with attention mechanism can adaptively select the image regions or features of interest and perform the task more accurately.The Transformer and graph convolution based on the attention mechanism can dynamically calculate the weight distribution of the input data.This weight distribution can be used to processing variable length sequence data.In addition,the structure of graph can be used to efficiently calculate the weights between relational nodes and thus better consider the relationships between nodes.Experiments demonstrate that the combined use of graph structure advantages and attention mechanisms can be more efficient for inference detection of video relationships. |