| With the rapid development of information technology and the largescale popularization of smart devices in recent years,a large number of videos have been produced in modern society.Considering the temporal relations among video frames,the context inevitably plays an important role in the research on video understanding.Most videos in reality last a long time,and they are likely contain a variety of action segments as well as the background content that users are not interested in.In order to understand the video content,the researcher needs to first recognize the semantic segment containing the action and then analyze its content.Therefore,this dissertation first studies the temporal action detection technology to identify the temporal boundaries(i.e.,the start and end points)and the corresponding categories of all the action segments within videos.The task of temporal action detection can be generally divided into two stages,i.e.,proposal generation and action classification,where the proposal generation task pre-generates temporal proposals that may contain actions to facilitate the subsequent classification.After detecting the temporal action segments,this dissertation studies the video captioning technology,which generates the descriptive sentences for the video content.These technologies have important social and practical significance.The technology of temporal action detection can recognize anomalous actions or events in the surveillance videos,and the technology of video captioning can convert the visual content into the language information.Due to the insufficient exploitation of video context with the existing understanding technology,this dissertation focuses on how to utilize deep learning and video context to deepen the understanding of video content.This dissertation discusses the technologies of video captioning and temporal action detection,which are summarized as follows:In order to generate temporal action proposals,this dissertation studies different sequence learning networks for the context modeling in video sequences,including the convolutional network and the self-attention network.Considering the duration limitation of previous algorithms for pre-defined anchor boxes,this dissertation proposes an algorithm for generating multiscale temporal action proposals,which is based on the convolutional network.First,the proposed algorithm utilizes convolutional operations for capturing temporal contexts in the video sequence and achieves remarkable speedup by parallelizing the computations.Second,the proposed algorithm divides the receptive field of the convolutional network into multiple scale ranges and then refines the corresponding temporal boundaries using duration regression at each scale.Based on this multiscale duration regression mechanism,the proposed algorithm relaxes the duration limitation for these anchors and generates temporal action proposals with arbitrary duration.To further relax the limitations of the anchor boxes and the network modeling range,this dissertation proposes a temporal action proposal generation algorithm,which is based on the two-level self-attention networks.The algorithm consists of two modules,which models respectively the frame-level relationship and the proposal-level relationship,thus completing the proposal generation task.In the frame-level relation module,the proposed algorithm divides the multiple attentional heads into several groups and encodes the local contexts at different temporal locations,which effectively captures the temporal boundary information in the video sequence.In the proposal-level relation module,the proposed algorithm incorporates the relative temporal positions between these proposals into relational modeling,which enhances their representation.After generating temporal action proposals,this dissertation proposes a temporal action detection algorithm that refines generated proposals based on the video context.First,to refine each generated proposal,the proposed algorithm augments this proposal with two neighboring segments of equal length,which utilizes the contextual information from the past and future segments to assist the detection of target segment within the augmented area.Second,the proposed algorithm not only regresses the temporal locations of the target segments,but also regresses their IoUs with ground-truths.Based on this regression strategy,the proposed algorithm obtains a precise estimation for the location and action probability,which improves the overall detection performance.Considering previous algorithms are not detailed enough for capturing temporal structures in videos,this dissertation proposes a video captioning algorithm based on the spatio-temporal context and the channel attention mechanism to capture temporal structures.First,through varying the convolution kernel size in the recurrent convolutional network,the proposed algorithm integrates the contexts of different spatio-temporal ranges into the video feature representation.Second,the proposed algorithm incorporates the channel attention mechanism on the basis of the frame attention mechanism,which highlights the internal participation of channel-level features when generating the descriptive words.In short,the proposed algorithm captures much finer temporal structures of videos,which improves the captioning performance.This dissertation has carried out sufficient experiments and analyses on the above algorithms.The experimental results show that our proposed algorithms not only extract the potential information from the video content efficiently,but also outperform the similar algorithms in the performance evaluation. |