| With the popularity of camera devices,video data is increasing explosively in recent years.It has become one of the most important data forms in daily life.Large-scale videos bring convenience to people meanwhile make it time-consuming to get useful information.Worse still,it increases the burden of data storage and retrieval.In this case,it is important to distill the video information effectively,so as to obtain a much condensed video form or other compact information modalities.Actually,video information distillation is the basis of video intelligent analysis,and has become a hot research topic in artificial intelligence.The typical tasks of video information distillation include video summarization and video captioning.They target to summarize the video content into a condensed version and generate cross-modal information to describe video activities.Practically,large-scale videos show the characteristics of high-dimensionality,high-complexity and high-redundancy.It brings numbers of challenges for the research of video information distillation,such as long-term dependency modeling,structure information exploring,cross-modal information fusion,etc.To address the above problems,this thesis focuses on the research of large-scale video information distillation,and has proposed several novel approaches for the video summarization and video captioning tasks.Details are described as follows.1.To model the video summary properties,a joint learning framework of multiproperty models is established.According to the characteristics of video summarization,this work develops four summary property models,including importance,representativeness,diversity and storyness.By analyzing the connections among the property models,a joint learning framework is constructed to achieve the effective fusion and mutual benefit of different models,so as to measure the summary properties comprehensively.The experimental results have verified the improvements of this work to the summary quality in containing important objects,representing the original content,reducing redundancy and preserving the storyline.2.To model the temporal dependencies in long videos,a tensor-train hierarchical LSTM is developed.On one hand,the tensor-train decomposition layer can reduce training parameters of the model,meanwhile,retain the original flexibility.On the other hand,hierarchical LSTM can capture the long temporal dependencies in videos and improve the non-linear fitting ability.The experimental results have demonstrated the effectiveness of the proposed approach to alleviate the overfitting problem caused by the high-dimensional video features,and reduce the information loss in long video temporal modeling.3.To capture the video structure information,a hierarchical structure-adaptive LSTM is proposed.This work designs a structure-adaptive mechanism to detect the video shot boundaries,and further conducts a multi-task architecture for shot segmentation and video summarization.By exploiting the video structure information,the summarization task can reduce the defects and mixtures in the key-shot set.Experiments confirm that this work can jointly improve the performance of shot segmentation and video summarization,which shows the superiority of exploiting structure information for video analysis.4.To achieve the unsupervised learning of video summarization,a dual learning framework is conducted.This framework employs the video reconstruction process to reward the summarization process,and utilizes the reward information to guide the optimization of summary generator.It can reduce the requirements of annotated data,and further achieve the unsupervised learning of summarization models.Based on the dual learning framework,two summarization approaches are developed by modeling the video as sequence and sequence-graph,respectively.In this case,the temporal dependency among frames and the global dependency among shots are analyzed.Experimental results have demonstrated that the dual learning framework can provide sufficient guidance for the summary generation process,so that the unsupervised approaches can perform comparably with supervised ones.5.To extract visual information precisely,tube feature is designed for video captioning.Tube feature is the spatial-temporal encoding of object trajectories in the video,where the tube is obtained by sequentially linking the detected regions of objects.It can reduce the interference caused by irrelevant background.Furthermore,under the assistance of the attention mechanism,tube feature enables the caption generator to adaptively focus on the most correlated objects,which can increase the discriminativeness of visual feature.Experiments have illustrated the superiority of tube feature compared with traditional global features in video captioning.6.To fuse the cross-modal information,a co-attention model is proposed.By analyzing the connections between visual and text information,this model achieves the precise extraction and effective fusion of them in the video captioning task.Specifically,in the caption generation process,the visual attention module encodes the most correlated frames and salient regions hierarchically,the text attention module encodes the previously generated phrase features adaptively,and the balancing gate regulates the influence of visual and text feature automatically.Experiments have demonstrated the necessity of visual and text feature fusion to the video captioning task. |