Spatio-temporal Attention Model For Video Captioning

Posted on:2020-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Wang

Full Text:PDF

GTID:2518306515484834

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of computer technology,video has become ubiquitous on the Internet,broadcasting channels,and personal devices.This has encouraged the development of advanced techniques to analyze semantic video for a wide variety of applications,such as video retrieval,automatic video subtitling,blind navigation,etc.Video understanding has been a fundamental challenge of computer vision for decades.Previous research has predominantly focused on describing videos with a predefined yet limited set of individual words.Thanks to the recent Recurrent Neural Networks(RNN),researchers have strived to automatically describe video content with a complete and natural sentence,which can be regarded as the ultimate goal of video understanding.Video captioning refers to generating corresponding sentences according to the content of the video,namely video features.Considering massive video information,it is obviously a huge waste of manpower and financial resources to describe all video manually.Therefore,automatic video captioning is an inevitable trend.It has obtained significant advances in recent years with the development of deep learning.Most of recent proposed methods are based on the Encoder-Decoder framework.Particularly,the encoder first utilizes Convolutional Neural Networks(CNN)to extract representations for static images and then the representations of all frames are stacked with a Recurrent Neural Networks to form the video representation,while the decoder utilizes another RNN to generate natural language descriptions.But due to the actors,objects,and complex interactions among them in the video,video captioning remains a challenging task.Thus how to spot the key areas to represent the visual content and encode them into rich descriptors for expressing the video is rather important.In this paper,we propose two methods for video captioning: the first methd encodes the sequential frames into a spatio-temporal representation at each time-stamp to utter a word and further distill most related visual content by an extra semantic loss,the second method automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning.

Keywords/Search Tags:

Video Captioning, Deep Learning, Attention Mechanism, Video Encoding, Spatio-temporal Representation

PDF Full Text Request

Related items

1	Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning
2	Researches On Short Video Captioning Based On Deep Learning
3	Video Action Recognition Based On 2D Convolution Network Under Spatio-Temporal Feature Enhancement Mechanism
4	Image Captioning Based On Deep Recurrent Convlution Network And Spatio-temporal Information Fusion
5	Research And Application Of Video Captioning Technology Based On Deep Learning
6	Research On Video Captioning Based On Deep Learning
7	Research On Video Captioning Based On Deliberation Mechanism
8	Spatio-Temporal Attention Networks For Video Question Answering
9	Research On Visual Captioning Based On Deep Learning
10	Research On Video Captioning Algorithm Based On Attention Mechanism