Font Size: a A A

Spatio-temporal Attention Model For Video Captioning

Posted on:2020-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:H Y WangFull Text:PDF
GTID:2518306515484834Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of computer technology,video has become ubiquitous on the Internet,broadcasting channels,and personal devices.This has encouraged the development of advanced techniques to analyze semantic video for a wide variety of applications,such as video retrieval,automatic video subtitling,blind navigation,etc.Video understanding has been a fundamental challenge of computer vision for decades.Previous research has predominantly focused on describing videos with a predefined yet limited set of individual words.Thanks to the recent Recurrent Neural Networks(RNN),researchers have strived to automatically describe video content with a complete and natural sentence,which can be regarded as the ultimate goal of video understanding.Video captioning refers to generating corresponding sentences according to the content of the video,namely video features.Considering massive video information,it is obviously a huge waste of manpower and financial resources to describe all video manually.Therefore,automatic video captioning is an inevitable trend.It has obtained significant advances in recent years with the development of deep learning.Most of recent proposed methods are based on the Encoder-Decoder framework.Particularly,the encoder first utilizes Convolutional Neural Networks(CNN)to extract representations for static images and then the representations of all frames are stacked with a Recurrent Neural Networks to form the video representation,while the decoder utilizes another RNN to generate natural language descriptions.But due to the actors,objects,and complex interactions among them in the video,video captioning remains a challenging task.Thus how to spot the key areas to represent the visual content and encode them into rich descriptors for expressing the video is rather important.In this paper,we propose two methods for video captioning: the first methd encodes the sequential frames into a spatio-temporal representation at each time-stamp to utter a word and further distill most related visual content by an extra semantic loss,the second method automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning.
Keywords/Search Tags:Video Captioning, Deep Learning, Attention Mechanism, Video Encoding, Spatio-temporal Representation
PDF Full Text Request
Related items