| Video retrieval and description is one of the research hotspots in deep learning during recent years.Video retrieval is performed by embedding the feature vectors of video and text into the common space and using computed similarity to filter matching videos from a massive video library;video description is performed by using an encoder to obtain the feature vectors of video and using a decoder to automatically generate text description.In the above tasks,the problem of multimodal data representation and cross-modal data interaction is inevitable,and high-performance network models are urgently needed.In recent years,the Transformer model has been gaining attention in video retrieval and description tasks due to its high information characterization capability.However,there are shortcomings in the application of the model in the above mentioned areas.In this thesis,we conduct an optimization study of video retrieval and description techniques based on the Transformer model to improve the task performance by enhancing the model’s characterization capability.The research content and main innovation points are as follows:In the video retrieval task,a cross-modal video retrieval algorithm based on Trans DCS(Transformer with Dynamic Convolution and Shortcut)is proposed around how to optimize the Transformer structure to enhance the multimodal representation of video data.To address the lack of inductive bias structure in the existing Transformer-based video retrieval model,the algorithm designs a multi-headed self-attentive module incorporating dynamic convolution to better capture spatio-temporal information when modeling video features.Meanwhile,to avoid the problem that the learning bias keeps increasing with the number of layers,the algorithm adds enhanced shortcut structures to the multi-headed attention module and the feed-forward fully connected layer to capture richer feature combinations and enhance feature transfer and fusion.Experimental results show that on the LSMDC dataset,the proposed algorithm improves video retrieval performance compared with the reference algorithm,with gains of about 2.3% and 6.1% in Md R and Mn R metrics and an average 1.8% improvement in recall.In the video description task,video description models based on dynamic memory networks are focused on,and it is found that such models are inadequate in processing and utilizing information from memory states,resulting in repetition and inaccuracy of feature representation.To address this problem,a cross-modal video description algorithm based on VM-MCA(Video Memory with Multi-head Attention and Channel Attention)is proposed in this thesis.On the one hand,combining with the multi-headed attention mechanism,the algorithm is improved the update judgment module of the network,and the memory state is embedded in multiple subspaces to enrich the semantic expression.On the other hand,to address the problem of mixed information bottleneck when aggregating features by the multi-head attention mechanism,the algorithm designs a channel attention mechanism to aggregate the information from different subspaces and reduce the interference between the information.Experimental results show that the proposed algorithm improves the video description performance on the Activity Net Captions dataset compared with the reference algorithm.The accuracy metric CIDEr is improved by 1.08%,and the discrimination metric Div@1 is improved by 2.23%. |