Font Size: a A A

Deep Learning-Based Fine-Grained Sports Video Captioning Research

Posted on:2020-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:H Y YuFull Text:PDF
GTID:2428330620460050Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Video captioning technology refers to the technique of constructing a neural network to mine the visual information and meaning of the image in a video and then output a description sentence or paragraph in a text form that is easy for human understanding.Through such technology,information is converted from a more redundant video format to a less redundant text form,which has great significance in this era of rapid information expansion,and has been widely used in video retrieval,action detection,content filtering and other aspects.In recent years,video captioning technology has been greatly developed,and a number of generation methods have emerged,solving technical problems such as motion positioning and multi-sentence generation.But how to generate fine-grained video captioning(that is,the description of multiple subjects' detailed actions and frequent interactions between subjects in a video with long time span and rich details)is still far from solved.This problem has great application value though,such as automatic narrative of sports videos.To this end,this paper proposes a fine-grained video captioning method,and is committed to solving the narrative of the sports videos with rich details and interactions.In order to achieve this goal,this work has made the following efforts:First,to investigate this new topic of fine-grained video captioning,this paper collected a completely new dataset,the Fine-grained Sports Narrative dataset(FSN).This dataset contains 12,000 HD basketball and volleyball sports videos from Youtube,and each dataset has been manually annotated with action proposals and paragraph statements.This dataset contains core challenges such as fine-grained actions and multiagent interactions that this task needs to solve.Second,this paper propose a new video captioning evaluation metric,the Finegrained Captioning Evaluation(FCE),to provide a more reasonable evaluation metric for this new task.FCE has improved on the basis of the currently widely used evaluation metric METEOR.It not only evaluates the semantic results,but also considers the accuracy of the detailed actions and the correctness of the description order of the actions.The latter two are precisely very important aspects of fine-grained video captioning.Finally,this paper propose a new deep learning neural network framework for finegrained video captioning task.This neural network consists of three sub-networks: 1)a spatial-temporal entity localization and role discovering sub-network,which divide the video to several proposals according to different action segments.And the characters in each proposal are positioned and identified in this subnetwork.2)a fine-grained action modeling sub-network using skeleton information.This sub-network improves recognition accuracy of detail actions by introducing improved skeleton description operators.3)a group relationship modeling sub-network for exploring interactions between players.After obtaining the output characteristics of these three sub-networks,this paper further fuse these features and encode-decode them through an h-RNN to finally obtain a description paragraph.This paper have conduct sufficient experiments on this FSN dataset.Through the scores of a number of evaluation metrics,this paper have demonstrated the rationality of our proposed fine-grained video captioning model for solving the problem of sports video narrative.In addition,this paper also compared our model with the state-of-art video captioning methods,and the results also demonstrate the effectiveness and superiority of our model.
Keywords/Search Tags:Video Captioning, Fine-grained, Deep Learning, Recurrent Neural Network, Sports Video Narrative
PDF Full Text Request
Related items