Research On Visual Captioning Based On Deep Learning

Posted on:2019-09-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y R Qiu

Full Text:PDF

GTID:2428330623962498

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer networking and information technology,social media has become an indispensable part of people's lives.Humans have entered the era of big data,due to the large amount of data circulating in the network.Images and videos have become the most popular form of data because of their ability to record and enrich our daily lives.In the field of artificial intelligence,research on how to make computers more human-like thinking and let them understand and describe visual information in natural language has attracted more and more attention.Visual captioning can be applied in fields of multimedia information analysis,human-computer interaction,and helping the visually impaired.At present,many researchers have done a lot of research in the field of visual captioning.In recent years,the encoder-decoder framework has been widely used in visual captioning tasks.Due to the good revelation of the relationship between visual information and captioning sentences,the temporal-attention mechanism has become the main method of current research.However,how to obtain dynamic visual features and semantic information accurately in video is still a difficult problem.In this paper,we proposed two methods for video captioning: 1)Fine-grained Spatial-Temporal Attention-based Model: fine-grained vision information are extracted from videos to obtain accurate visual features of the region-level,which is regarded as hard spatial attention,and the temporal-attention based LSTM network is used to ensure that the words in generated sentences can accurately correspond to the relevant visual features;2)Dual-Stream Attention Model Based on Visual and Semantic Features: the visual features and semantic features are extracted from videos at the same time,and the temporal-attention based LSTM network is used to realize dynamic attention of multimodal information and improve the accuracy of the captioning sentences.In this paper,experiments are conducted on two popular datasets: MSVD and MSR-VTT.And the effectiveness of the proposed methods is verified by comparison with other methods.

Keywords/Search Tags:

Video Captioning, Attention Mechanism, Multimodal Information

PDF Full Text Request

Related items

1	Construction Of Multimodal Dataset With Sign Language And Research On Video Captioning Method
2	Research On Video Captioning Based On Deep Learning
3	Research On Video Captioning Algorithm Based On Attention Mechanism
4	Research On Video Captioning Based On Deep Learning
5	Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning
6	Researches On Short Video Captioning Based On Deep Learning
7	Research On Multimodal Deep Learning Algorithm Based On Attention Mechanism
8	Research On Image Captioning Methods Based On Deep Learning
9	Research And Application Of Video Captioning Technology Based On Deep Learning
10	Research On Video Captioning Methods Based On Feature Fusion And Attention Mechanism