| With the explosive growth of video data in recent years,traditional human-based video analysis methods are difficult to meet the real world needs.Intelligent video analysis algorithms based on artificial intelligence,especially deep learning,have become hotspots for both academic area and industry area.As one of the key technologies,temporal action localization algorithm aims to find the start time and end time of the action of interest in a video.And it has great potentials in many real world applications,such as intelligent monitoring,sports event analysis and video summary.However,most of the existing methods are fully supervised and need large amount of carefully annotated videos for model training.This requirement limits their scalability and practicability in the real world application scenarios,because it is prohibitive and time-consuming to construct a large dataset with the action category and temporal boundary annotations.This thesis aims to reduce the annotation cost for temporal action localization,and conduct systematical studies into video-level text description,video-level category labels and data-level action category number supervised temporal action localization algorithms.The three kinds of supervised information are gradually weakened,and the research for the corresponding algorithms are gradually more difficult.Specifically:(1)In the video-level text description as supervisory information,the current method can hardly cope with the problem of cross-modal matching between video and text under weak supervisory information,leading to large deviation of action localization boundaries.(2)In terms of video-level category labels as supervisory information,the current methods are difficult to achieve complete action localization and handle complex background interference,leading to many false positive and false negative detections;(3)In terms of data-level category numbers as supervisory information,it is difficult to generate high-quality pseudo-labels,resulting in large deviations in the semantic categories of action localization.This thesis conducts in-depth research on these key challenges,and the main innovations and contributions are summarized as follows:1.A novel local correspondence aware network is proposed for video-level text description based temporal action localization.In order to deal with the video-text cross-modal matching problem under supervision of text description,two core modules are proposed in the algorithm,including hierarchical feature representation module and cycle consistency modeling module.The hierarchical feature representation module rearrange video and text features into a structured matrix,making it is easy to model fine-grained video-text local correspondence;The cycle consistency modeling module can help to learn robust local similarity between video and text through a self-supervised loss,so as to learn robust fine-grained correspondence between video and text.Results on two datasets show that the proposed algorithm can achieve significantly better performance than existing methods.2.A novel structure-aware network is proposed for video-level category labels based temporal action localization.To deal with incomplete action localization problem under the supervision of video-level category labels,there are two structure-aware modeling modules:global structure modeling module and local structure modeling module.The global structure-aware modeling module can help to learn more robust video representations by modeling the relationship between video segments,which can help to avoid the same action being divided into multiple clips in action localization;The local structure-aware modeling module can model the temporal structure of actions by discovering the composition units of actions,which can help to avoid focusing only on the most salient action segments.Experimental results show that this method can significantly improve the completeness of action localization.3.A novel uncertainty-guided collaborative training algorithm is proposed for video-level category labels based temporal action localization.To deal with complex background interference problem under the supervision of video-level category labels,two core modules are proposed in the algorithm:online pseudolabel generation module and uncertainty-aware learning module.The online pseudo-label generation module can use the teacher model as a bridge to generate fragment-level foreground/background pseudo-labels,so that the FLOW model and the RGB model can learn from each other during training;the uncertaintyaware learning strategy can spontaneously learn the reliability of pseudo-labels from data and utilize a newly designed uncertainty-aware loss to mitigate the negative effects of noisy pseudo labels.The experimental results on two datasets with three methods show that the proposed algorithm can significantly suppress the complex background interference and improve the performance of these methods.4.A novel optimal transport algorithm is proposed for dataset-level action category numbers based temporal action localization.To generate high-quality pseudolabels under the supervision of dataset-level category numbers,the algorithm models the pseudo-label generation as an optimal transport problem and considers three core constraints:"Consistent,Compact,and Uniform".The "Consistent" constraint can make the semantics of each pseudo-category remain unchanged during the training process,which is of great significance for stable model training;The "Compact" constrain can make video features with similar pseudo-labels close to each other,which is of great significance to ensure the accuracy of pseudo-labels;The "Uniform" constraint can make the number of generated pseudo-labels roughly equal across classes,which can effectively avoid pseudo labels collapsing into a few large classes.Experimental results on two datasets show that the proposed algorithm can significantly improve the quality of pseudo-labels and achieve superior localization performance. |