Font Size: a A A

The Research For Cross-media Sentence Generation And Localization

Posted on:2022-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:M X ZhangFull Text:PDF
GTID:1488306764958639Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of deep learning,the research of computer vision is no longer satisfied with simple annotation but faced with more and more diverse and complex tasks.This dissertation will make an in-depth study on the cross-media semantic understanding between image/video and text,especially on the topic of cross-media semantic generation and localization.Specifically,this dissertation firstly studies the effects of image region localization and visual attention mechanism for image semantic understanding and description generation,and proposes an image description generation network based on those.Then,this dissertation further considers the image description generation task,and proposes an online positive label recall and missing concept mining method to detect more semantics and then take the semantic selection for image description generation.Finally,from the perspective of cross-media semantic localization,this dissertation studies the specific task of video statement localization,and proposes a multi-stage aggregated transformer network for video language localization.From static image to dynamic video,from visual content to text semantic generation and from text semantic to visual content localization,this dissertation has studied different perspectives of the cross-media semantic understanding problem between video and text.In detail,the content of this dissertation mainly includes the following aspects:1)Semantic region localization and visual attention for image description generation.In this section,a novel end-to-end image description generation framework is proposed for high-level semantic understanding of images.Compared with the previous method,this framework for the first time combines semantic region localization and visual attention mechanism to automatically generate image descriptions to generate the image description.This method introduces semantic localization and visual attention module,providing the possibility and conditions for the image high-level semantic understanding.In addition,this semantic localization framework is widely used in many other applications.2)More concepts detection and selection for image description generation.This section is also about image description generation problem.In order to further improve the quality of image description,this dissertation decouples the visual semantic detection phase and description generation phase,and propose the idea of firstly providing more semantic concepts and then applying the concept selection process to automatically generate the image description.In order to detect more semantic concepts,this dissertation proposes the on-line positive recall and missing concepts mining(OPR-MCM)method.At the same time,this dissertation also discusses the semantic selection process in the description generation stage.With rich semantic concepts and high detection accuracy,the method can generate more accurate and detailed image description.3)Multi-stage aggregated Transformer network for video language localization.This section aims at high-level semantic understanding of the more complex image sequence,i.e.,video.In this section,a multi-stage aggregation Transformer network is proposed for video language localization.This network introduces a new visual language Transformer backbone that maintains structural concise and modal specificity to support fine-grained visual language alignment.In addition,this dissertation also proposes a multi-stage aggregation module,which computes three specific representations for three different time stages respectively,and performs multi-stage aggregation for candidate moments to obtain more distinguished features for accurate localization.The model has good scalability and can use a large amount of video language data to conduct pre-training.4)Recurrent attention network using spatial-temporal relations for video feature representation.Because of the complexity of video data,this section also studies the basic task of video feature representation.The high-level cross-media semantic understanding of the video needs more effective video feature representation.This dissertation puts forward a new kind of attention mechanism for video feature representation.The attention mechanism deduces a new attention unit from the standard LSTM and uses the gating system to calculate the weight.Then,the attention units are built into a 3D recurrent network.Different from previous attention models that independently calculate the attention weights for different regions,this network can explore the spatio-temporal relationship between different video regions to achieve more accurate attention,obtain more accurate video feature expression,and improve the accuracy of video action recognition and video sentence localization.Finally,this dissertation summarizes the research contents of the whole paper,and makes the further thinking and prospects for the follow-up research,and puts forward the possible research directions.
Keywords/Search Tags:Cross-media high level semantic understanding, sentence generation and localization, image description generation, video language localization
PDF Full Text Request
Related items