Font Size: a A A

Image Captioning Based On Generative Adversarial Network With Temporal Attention

Posted on:2022-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:H L DuanFull Text:PDF
GTID:2558307109469524Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Image captioning is a technology in which computer models automatically generate natural language captions for given pictures.In recent years,image captioning technology has been widely used in fields such as image review and smart medicine,and has attracted more and more attention from scholars in related fields.In particular,the introduction of reinforcement learning and attention mechanism has strongly promoted the development of image captioning technology.Pictures have rich visual representations,meanwhile,effective and rich feature information plays a vital role in image captioning tasks.However,most of the related methods do not make good use of the feature information of the picture.At the same time,there are also accumulated errors in the process of sentence generation.In view of the above problems,the main research work of this thesis is as follows:In order to utilize the rich information contained in image features to a greater extent,this thesis constructs a multi-attention mechanism,which can achieve more effective feature representation and reasoning in image captioning by using local and global features.On the basis of this mechanism,a multi-attention generation adversarial image captioning network(MAGAN)is further built,which includes a multi-attention generator and a multi-attention discriminator.The designed generator is used to generate more accurate sentences,and the discriminator is used to determine whether the generated sentences are described by humans or generated by machines.This method has obtained highly competitive results on the official data set.So as to strengthen the correlation between the attention results and the hidden state at different moments,and solve the problem of cumulative errors in the word generation process to a certain extent,This thesis proposes a temporal attention network(TAN)based on the commonly used encoder-decoder framework by extending the traditional attention mechanism.TAN first pays attention to the hidden state and feature vector at the current moment,and introduces the attention results of two adjacent LSTM segments into the network loop at the next moment through the "attention fusion slot"(AFS)to enhance the attention result correlation with hidden state.In addition,this thesis designs a "hidden state switch"(HSS)to guide the generation of words,and combining it with AFS can solve the problem of accumulated errors to a certain extent.A large number of experiments on the official data set Microsoft COCO show that the model proposed in this thesis has obvious advantages compared with the baseline model.For the purpose of solving the accumulated error problem to a certain extent,and at the same time make better use of the rich image feature information to improve sentence diversity,this thesis first introduces a temporal attention mechanism,and then proposes a generational adversarial image captioning network based on temporal attention.Quantitative and qualitative experiments verify that the generative adversarial network with temporal attention can weaken the impact of cumulative errors in the word generation process,and improve the diversity of generated descriptions to a certain extent.Experiments show that the method has certain research value and research prospects.
Keywords/Search Tags:Image captioning, Attention mechanism, Generative adversarial network, Encoder-Decoder, Reinforcement learning
PDF Full Text Request
Related items