| With the comprehensive development of the Internet,the carrier of information has gradually been replaced by video.Existing supervised learning methods require training network models using tag information,and therefore require large-scale manually annotated datasets.However,the annotation of datasets undoubtedly consumes a large amount of resources and time.Self-supervised learning can solve this problem,in which contrastive learning learns representation ability by distinguishing positive and negative samples.To further improve the representation performance of contrastive learning,this paper conducts the following research in terms of feature temporality,positive and negative samples,and residual space:Firstly,aiming at the problem that complex backgrounds and insufficient temporal features in video data affect the representation effectiveness of self-supervised contrastive learning,a video complementary collaborative contrastive representation learning model is proposed.First,the network model is trained in the original RGB and optical flow space respectively by instance contrastive.Then,the feature timing is increased,and the complementary information between different views from the same data source is used to further mine positive samples for model retraining.This method can improve the accuracy of distinguishing positive and negative samples,thereby improving the video representation performance of the model.Secondly,aiming at the problem that the singularity of pretext tasks in video data and the small number of hard negative samples affect the representation effectiveness of self-supervised contrastive learning,an Pretext-Contrast representation learning model based on hard negative samples is proposed.The model combines representation learning based on pretext task with contrastive learning to further improve the spatio-temporal representation performance of the model.In addition,a feature level fusion method is proposed to expand the negative sample set by combining query samples with negative samples to generate hard negative samples,effectively improving the representation performance of the model.Finally,aiming at the problem that insufficient motion information in input data and lack of temporal coherence in video features affect the representation effectiveness of self-supervised contrastive learning,a residual contrastive representation learning model based on temporal diversity is proposed.This model increases the time contrastive loss to increase the temporal diversity of features.In addition,the residual frame view is introduced into the model,and the spatio-temporal representation performance of the model is further improved by using strong spatial enhancement,which significantly improves the effect in video understanding tasks. |