| With the rapid development of computer technology,generating multi-sentence descriptions for videos has become one of the most challenging computer vision tasks,as it requires not only visual relevance but also discourse coherence between sentences in a paragraph.Dense video description aims to generate dense descriptions for all possible events in an untrimmed video.This task is challenging in that it requires accurate localization of the events in the video and simultaneous description of each event in sentences.Most of the current models use a two-stage approach,which leads to low model efficiency,and dividing the task into two parts with each stage independent of each other,makes the model not well applied to the information of the context;For feature extraction,most of the previous works of dense video description are based on visual information only,ignoring audio information and semantic information,which leads to incomplete description results.In order to solve the above problems,this paper proposes a multimodal fusion dense video description method based on Transformer network.First,the end-to-end dense video description model is designed,the model applies parallel prediction head,and the target detection algorithm Detection Transformer network is applied to the dense video description to realize the end-to-end description of the video and generate video descriptions that can be correlated with each other.For the encoder-decoder framework,the multi-scale deformable temporal attention module is proposed that enhances the ability of the model to apply local information,speeds up the convergence of the model while improving the model accuracy;On the basis of the end-to-end model,multimodal feature fusion information is added,and an adaptive R(2+1)D network is proposed to extract visual features and add audio features to complement them to generate a richer and more accurate description.A semantic detector is designed to generate semantic information to improve the semantic inconsistency problem of video description results.Comparative experiments and ablation experiments were conducted on large benchmark datasets Activity Net caption and You Cook2,and better results were obtained on all four evaluation metrics.On the Activity Net dataset,the evaluation metric BLEU_4 on reached 2.17,which is better than the best result so far,and METEOR,CIDEr,and SODA reached 9.23,42.14,and 6.05,respectively.For the You Cook2 dataset,the experimental model performed well on all four scores,and BLEU_4 reached 0.92.The experimental results demonstrate that the model performs differently on both large benchmark datasets Activity Net caption and You Cook2.The model in this paper applies an end-to-end approach using multimodal feature inputs,which leads to a significant improvement in model performance. |