| Nowadays,with the rapid development of science and technology,online video has become one of the important ways for people to acquire knowledge and entertainment.Compared with traditional information transfer media,video resources contain more complex semantic information.In long video scenes,background information occupies most of the resources,so accurately identifying the valuable clips and extracting the feature information is the key to video understanding technology.Traditional video understanding techniques primarily encode the semantic information in video feature expressions in chronological order.However,this approach ignores the non-sequential semantic relationships between objects in video resources,so a series of studies based on Graph Convolution Network(GCN)technology have emerged.Using this technique can break the traditional learning approach,enrich the connection relationship between objects,and improve the capability of model feature representation.However,the existing advanced models use the technology to increase the feature expression capability of the relevant models,and lack of fine-grained exploration of the Graph Convolution Network technology.To address this problem,this paper develops targeted research in two works,Temporal Action Localization(TAL)and Video Captioning(VCS).For temporal action localization task,existing models all consider the feature expression of a video as a whole for semantic encoding.Although these approaches have proven their effectiveness,it is found in practice that such coarse-grained processing leads to very blurred event boundaries for model localization.To address this problem,this paper proposes a decoupled graph-based temporal action localization model,which will slice the feature expression of the video into multiple semantic blocks,each of which will learn one kind of semantic information in the video.In order to reduce the problem of information redundancy among semantic blocks,this paper designs multiple sets of constraint functions and automatic routing modules in the decoupling graph to supervise the learning process of each set of semantic blocks.Secondly,this paper designs a node feature fusion module considering the global semantic information,which will calculate the importance of each group of semantic blocks based on the semantic information of each group of graphs separately and fuse the feature expressions of nodes based on this score.Finally,the proposed model is validated on Activity Net-1.3 and Thumos-14 public datasets,and the model experimental results and module ablation experiments prove the excellent performance of the proposed model.For video description task,we propose a multimodal video description model based on a cause-aware inference network to solve the problem of inaccurate boundaries in traditional video description tasks.The network will decouple object features from scene information and reduce the interference of specific scenes on object information.Secondly,this paper constructs object-to-object interaction graphs in temporal and spatial dimensions,uses graph convolution algorithms to reason about the causal relationships between objects and then uses more accurate words to describe the motion states between objects.Finally,this paper conducts many model comparison experiments on the publicly available MSVD and MSR-VTT datasets.Its results show that the model proposed in this paper achieves excellent performance in the four metrics of BLEU4,METEOR,CIDEr,and ROUGE-L,respectively. |