Font Size: a A A

Video Analytics Based On Deep Spatio-Temporal Feature Fusion

Posted on:2022-09-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L SongFull Text:PDF
GTID:1528307154466604Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Video is an invaluable medium of multimedia.Compared with images,videos have more abundant forms information in spatial and temporal dimensions.With the popularity of multimedia applications and the development of multimedia technology,the role of video analysis and understanding tasks in real scenes is becoming more prominent,especially in security,transportation,medical and other fields.This thesis studies on video motion estimation and action recognition tasks via spatiotemporal feature fusion,focusing on the temporal dynamics of videos and proposing reasonable solutions to the objective task by joint spatiotemporal information.This thesis focuses from lowlevel shortterm temporal representation task in video analysis to highlevel longrange video understanding with joint spatiotemporal feature,and extends from supervised specified video tasks to unsupervised representation with richer feature space,gradually improving the ability to fuse deep spatiotemporal feature information effectively,and improve joint spatial and temporal dimensions of video representation capability in video analysis tasks.For motion estimation,this thesis adopts spatiotemporal contextual information as an effective complement to the matching information,and employs a multilevel structure for detail capturing,to model irregular objects and large displacements.The methods in this thesis show significant performance in each task and provide effective solutions for the corresponding tasks.For video action recognition,this thesis proposes longrange temporal related action recognition and corresponding selfsupervised and crossdomain transfer learning tasks.The results and innovations in this thesis are summarized as follows.1.This thesis proposes an optical flow estimation network based on pyramidal correlation matching and contextual residual reconstruction,for effective multiscale matching and reconstruction compensation.It also uses both supervised and unsupervised loss terms in the loss function,and introduces optical flow regularization according to the optical flow characteristics.2.This thesis proposes a new model for contextaware optical flow estimation.Around the proposed contextual attention framework,three corresponding contextual modules are proposed for effective feature extraction,correlation corresponding and optical flow reconstruction,respectively.A lightweight matrix multiplication method is also proposed for accelerating the computation of the contextual attention module.3.This thesis proposes a new framework for video classification with joint spatiotemporal information,using temporalspatial mapping operation to embed timespace information of video with features of dense frames,and exploring temporal dynamics through the convolutional neural network with temporal attention mechanism.4.This thesis proposes a new scheme for video selfsupervised contrast learning and video selfsupervised crossdomain transfer learning,to improve the generalization of local clip-level and global video-level temporal modeling,and to measure videolevel differences between source and target domains by videolevel contrast crossdomain alignment as a metric.The proposed methods are effective for corresponding tasks according to experimental results,and are helpful for related applications.
Keywords/Search Tags:Video analytics, Spatio-temporal feature fusion, Action recognition, Motion estimation, Video self-supervision, Video domain adaptation
PDF Full Text Request
Related items