Image Captioning Based On Visual Relationship And Semantic Correlation

Posted on:2023-11-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Wang

Full Text:PDF

GTID:1528307331472044

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Image captioning aims to generate textual descriptions that are consistent with image content and logically sound.It bridges the two modalities of vision and language,and can be used in scenarios such as robot chatting and visual aids for visually impaired people.Therefore,it has important research value and bright application prospects.The task,in essence,is to translate from one modality(images)to another modality(texts)without semantic changes.Hence,it is important to detect the high-level semantics in images and fully explore the visual relationship and semantic correlation in the two modalities.Based on the above analysis,this thesis conducts the study of visual relationship and semantic correlation from the following aspects: building the spatial relationship and geometrical relationship for semantics,improving attention correlation,distilling semantic topics,and improving the semantic correlation between vision and language,then applies them to image captioning,image paragraphing,and storytelling.The main contributions of this thesis are summarized as follows:(1)Spatial relationship-based image captioningIt is a great challenge to read and write scene texts for image captioning,which often leads to the problem of repeating words and incorrect order of scene texts.To solve these problems,the thesis proposes to build the spatial relationship between scene texts.First,several scene texts are detected from images,based on which the relative angles are computed and the spatial relationship is established.The relationship is further employed to update the word probabilities in the inference phase,which enables easier prediction of scene texts.In addition,a multi-modal attention mechanism is proposed to equally consider the information from objects and scene texts.Such a design may achieve a more thorough understanding of the image content,and make the descriptions more informative.(2)Geometrical relationship-based image captioningThe previous work defnes a simple spatial relationship of scene texts and artifcially formulates the using rules.Therefore,the relationship between scene texts could not be fully utilized,and the model capability is limited.To solve these problems,the geometrical relationship is established to enhance the correlation between scene texts.The height,width,distance,Io U,and orientation relations are comprehensively considered.To integrate the learned relationships,the proposed method capitalizes on a relation-aware pointer network.These designs can resolve the incomplete or disordered problems,and can avoid the subjectivity and one-sidedness that may occur in artifcial rules.(3)Attention correlation-based image captioningThe measure of the attention in each time step is often independent and the connections across attentions are seldom explored in existing methods.As such,the sequential evolution is not yet fully encoded into attention and the results may sufer from repeating or incomplete problems.We propose to mitigate this issue from the viewpoint of memorizing the attention history and capitalizing on such contextual knowledge to compute the next attention.Technically,a memory module is proposed.First,read the attention context from the memory.Then compute attention features and word probabilities based on it.After word prediction,update the memory with gate units.Such a design encodes the sequential evolution into attention,thus the incomplete or disordered problem can be resolved.Furthermore,we endow the attention mechanism with more power by seeking a hybrid of easily trainable soft attention and more accurate but nondiferentiable hard attention,which improves the model performance while reducing the training diffculty.(4)Semantic topic-based image paragraphingBased on the study of short texts in the frst three works,this work takes one step further to investigate long texts.A key issue of image paragraphing is to distill the semantic topics.It requires a comprehensive analysis of the semantic concepts and correlations in images and a proper distillation of knowledge.To solve the problem,a convolutional auto-encoding module is designed for topic modeling.It frstly utilizes a convolutional encoder to encapsulate region-level features into the topics,which are endowed with holistic and representative information through achieving high reconstruction quality by a deconvolutional decoder.The distilled topics are further integrated into a two-level LSTM-based paragraph generator,enabling the inter-sentence dependency modeling in a paragraph via the paragraph-level LSTM and topic-oriented sentence generation through the sentence-level LSTM.(5)Incorporating language style and image-text consistency for storytellingCompared to the objective descriptions learned in the above works,the task of storytelling is even more challenging due to the diffculty in modeling an ordered photo sequence and in generating a relevant paragraph with an expressive language style.To deal with these challenges,a language style and image-text consistency modeling approach with reinforcement learning and adversarial training is presented in this thesis.Specif-cally,a generation network is designed for taking actions and creating stories,and two critic networks(a multi-modal discriminator and a language-style discriminator)give assessments of the stories.The story generator and the reward critics are further considered adversaries.The generator aims to create indistinguishable paragraphs to human-level stories,whereas the critics aim at distinguishing them and further improving the generator by policy gradient.The combination of reinforcement learning and adversarial training ensures the relevance and the story style for generated stories.

Keywords/Search Tags:

Image Captioning, Image Paragraphing, Storytelling, Visual Relationship, Semantic Correlation

PDF Full Text Request

Related items

1	Image Feature Understanding And Semantic Representation Based On Deep Learning
2	Research On Semantic Attribute Based Visual Semantic Image Captioning Method
3	Research On Visual Semantic Graph Construction And Its Application In Image Captioning
4	Research On Image Captioning Algorithm Based On Deep Neural Networks
5	Neural Networks Based Image Captioning Models For Obtaining Accurate Descriptions
6	Hierarchical Visual Semantic Embedding For Image Captioning
7	Research On The Theory And Method Of Visual Captioning
8	Image Captioning Theories And Methods
9	Research On Image Captioning Algorithm Based On Deep Learning
10	Research On Image Captioning Methods Based On Deep Learning