| With the development of mobile internet technology,the demand for quickly understanding video content is expanding.As a result,video understanding tasks have received more attention from researchers.Among video understanding tasks,the language moment localization task belongs to cross-modal high-level semantics,which aims to locate the specific time frame of relevant activities in the video based on a natural language description.Researches on this task can mainly be divided into fully supervised methods and weakly supervised methods.The former requires meticulous annotations of video data,which is excessively high in cost.Therefore,this thesis focuses on the weakly supervised language moment localization task.Previous weakly supervised models commonly adopt multiple instance learning methods and follow moment candidate selection pipeline.However,due to lack of the annotation of ground truth moment,these models suffer from the optimizing problems and are prone to falling into local minima during training,which is mainly manifested as the uncertainty of event temporal boundary and the incomplete semantic matching with the sentence.This thesis proposes a structure-and-semantics-guided pseudo-label-supervised-localization pipeline to alleviates the above problem faced by weakly supervised methods.Specifically,this thesis first proposes a matching score curve learning algorithm between video frames(in this thesis,video frames default to superframes,i.e.,a series of continuous frames in the video)and the sentence query based on video structural information,instead of directly learning the moment-sentence matching scores.This curve is used to generate pseudo-labels to supervise the localization network.Because the score curve contains the full extent of the video sequence,with its temporal content structure information,the proposed model can reduce the learning uncertainty and localize the moment with a fuller event process.Secondly,in order to achieve complete semantic matching with the sentence,this thesis proposes a semantic contrastive training strategy and a semantic prediction module to guide the model learning from the unmatched and matched video-sentence pairs respectively.In the semantic contrastive training strategy,this thesis constructs several contrastive samples containing both similar and different semantics to push the model accurately learn different semantics and make complete semantic matching,while the semantic prediction module achieves accurate visual-sentence alignment by restraining the activation of visual contents in the matched videos.This thesis conducts extensive experiments on Charades-STA and Activity Net-Captions datasets,and achieves optimal or suboptimal results compared to the state-of-the-art under multiple metrics of Rank n @ Io U=m.The code is publicly available at https://github.com/yetokun/WLML. |