| With the rapid development of computer networking and information technology,social media has become an indispensable part of people's lives.Humans have entered the era of big data,due to the large amount of data circulating in the network.Images and videos have become the most popular form of data because of their ability to record and enrich our daily lives.In the field of artificial intelligence,research on how to make computers more human-like thinking and let them understand and describe visual information in natural language has attracted more and more attention.Visual captioning can be applied in fields of multimedia information analysis,human-computer interaction,and helping the visually impaired.At present,many researchers have done a lot of research in the field of visual captioning.In recent years,the encoder-decoder framework has been widely used in visual captioning tasks.Due to the good revelation of the relationship between visual information and captioning sentences,the temporal-attention mechanism has become the main method of current research.However,how to obtain dynamic visual features and semantic information accurately in video is still a difficult problem.In this paper,we proposed two methods for video captioning: 1)Fine-grained Spatial-Temporal Attention-based Model: fine-grained vision information are extracted from videos to obtain accurate visual features of the region-level,which is regarded as hard spatial attention,and the temporal-attention based LSTM network is used to ensure that the words in generated sentences can accurately correspond to the relevant visual features;2)Dual-Stream Attention Model Based on Visual and Semantic Features: the visual features and semantic features are extracted from videos at the same time,and the temporal-attention based LSTM network is used to realize dynamic attention of multimodal information and improve the accuracy of the captioning sentences.In this paper,experiments are conducted on two popular datasets: MSVD and MSR-VTT.And the effectiveness of the proposed methods is verified by comparison with other methods. |