Font Size: a A A

Research On Deep Learning Based Video Classification Technologies

Posted on:2019-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:H X ZhiFull Text:PDF
GTID:2428330566470948Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Along with the coming of we media era,recording,watching and sharing videos has been an indispensable part of our life,and the number of videos on the internet continues to grow at the speed of the geometric order of magnitude.The spread of these massive videos provides convenience to our life on the one hand and stem challenges to the regulation and search on the other hand.Therefore,it is imperative that developing an efficient and intelligent video content analysis method to meet the demand of massive videos processing.As an efficient method of video content analysis and an important direction of the domain of computer vision,video classification is an effective way to solve the above problems,which has been drawing more and more attention.In recent years,deep learning has been applied successfully in many visual tasks of computer vision domain,especially in image recognition task,and has become the mainstream method in this domain.However,compared to the image processing,the research work of deep learning in video processing is still in the prototype stage and not deep enough.In this situation,this thesis devotes to the research of deep learning based video classification technologies,and mainly focuses on the feature extraction,feature fusion and similarity measure links in the basis of a comprehensive relative research review.We propose a better representational capacity and robustness deep feature extraction algorithm,an intrinsic contextual relationship modeling algorithm for deep spatial and temporal feature as well as a similarity measure algorithm with considering the difference of the semantic distance of negative samples to improve the performance.Specifically,the main work and contributions of this thesis are as follows:1.To solve the problem that existing temporal pyramid based feature extraction methods could not learn the temporal dependency among frames and video segments as well as the layered video structure information,which resulting in the insufficient extraction of feature,we propose a novel multi-level and multi-granularity feature extraction method based on cascaded SRU.Firstly,we extract the low-,mid-and high-level frame features from basic CNN;Then we use cascaded SRU to build the temporal pyramid at every feature level and learn the temporal dependency of videos;Finally,we fuse the three temporal pyramids to get the multi-level and multi-granularity representation of the video.The experimental results show that the feature extracted by this method has better representational capacity and robustness.2.To solve the problem that existing feature fusion methods could not adequately model the intrinsic relationship between deep temporal and spatial features,we propose a cascaded two-level encoding fusion algorithm.Firstly,we use Fisher vector to encode the interframe information of deep temporal and spatial features respectively.Then we use vector of locally aggregated descriptors to encode the intraframe information of the above features to get a global representation of the video.The experimental results show that the feature fusion method can more effectively learn the intrinsic relationship between deep temporal and spatial features.3.To solve the problem that existing deep metric learning methods do not take into account the difference of semantic distance of negative samples,and give equal importance and same margin to them,which resulting in the insufficient learning of the characteristics of hard negative samples,we propose an adaptive margin deep metric learning method in the basis of a distributing function.Firstly,we calculate the Euclidean distance between features as the semantic distance between samples.Secondly,we design a margin distributing function,and dynamically allocate margin in the basis of the semantic distances.Finally,by calculating the loss and propagating backwards,the network can learn the difference of the sample semantic distance,so as to automatically focus on the hard negative samples,more fully learn the characteristic of them.Therefore,we can improve the categorical decision ability of the network.Experiments show that the proposed method can improve the classification precision compared to the state-of-the-art method.
Keywords/Search Tags:Video Classification, Deep Learning, Multi-level and Multi-granularity Feature, Cascaded Encoding Feature Fusion, Adaptive Margin
PDF Full Text Request
Related items