Font Size: a A A

Research And Application Of Image Caption Method Based On GAN And GRU

Posted on:2023-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z X GuoFull Text:PDF
GTID:2568306848977419Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The purpose of the image description task is to automatically generate descriptive statements of images by the computer.In recent years,image description technology has gained wide attention for two main reasons: on the one hand,image description technology has broad application prospects,such as blind navigation,intelligent monitoring,early childhood education,etc.;on the other hand,image description is a cross-modal technology for computer vision and natural language processing.Therefore,it requires computers not only to recognize the main entities in an image and their attributes,and to link the relationships between entities,but also to describe them in natural language.Currently,the main research method for image description is the encoder-decoder(Encoder-Decoder)framework based on deep learning,and although improvements of Encoder-Decoder are continuously proposed and the accuracy of the generated image description statements is improving,the model structure has shortcomings: 1.The traditional Encoder-Decoder model The model is trained using great likelihood estimation,which requires the model to generate description statements with maximum probability consistent with the true description,ignoring the naturalness and diversity of the image description language expression.2.The correlation match between the description text generated by the traditional Encoder-Decoder model and the image content is not high,which reduces the quality of the generated description statements.3.The application of the traditional image description technology in traffic monitoring,the application of traditional image description technology in traffic monitoring is limited mainly because of the lack of traffic monitoring class datasets suitable for image description.In this paper,we propose a generative adversarial network-based image description model to improve the above problem.Generative adversarial networks generally consist of two parts: a generator and a discriminator.The goal of the generator is to generate description statements as identical as possible to the real description to fool the discriminator,while the main job of the discriminator is to determine whether the input sentences are real descriptions or generated by the generator.The two are trained alternately until convergence.The main research work of this paper is:(1)This paper proposes a generative network based on Encoder-Decoder,and proposes a new fusion attention mechanism in the generative network,so that the decoder can better understand the content of the image.Its main function is the local features of the image and the global features are combined to communicate,and the fused feature vector is obtained by calculation,so that the decoder can generate more accurate description text.Secondly,through in-depth research on convolutional neural networks and recurrent neural networks,we choose the Res Net101 network to improve the encoder.Through its residual connection,it can effectively avoid gradient disappearance or gradient explosion when extracting image features efficiently.The decoder of the model in this paper selects the Gated Recurrent Unit to optimize the processing of sequence text.It not only has a long-term memory function,but also has fewer parameters than the LSTM with the same function.The network structure is more concise,and the efficiency of model training is improved.(2)This paper proposes a discriminant network based on GRU.The discriminator uses a gated recurrent unit as the main body code,and its input generates description,real description and image feature vector,and the image feature vector and text encoding vector are input into the fusion attention.The output focuses on integrating the vector and doing semantic matching with the encoded vector.When training the model,because the generator outputs discrete text,the gradient signal cannot be back-propagated to the generator.Therefore,this paper uses a training method based on reinforcement learning to achieve the training of this model.Second,this paper proposes a language evaluator composed of various evaluation metrics to output objective evaluation scores.The outputs of the discriminator and language evaluator are combined as a reward for the generator,guiding the generation of the generator.The effectiveness of this model is verified on the MSCOCO public dataset.(3)This paper proposes a traffic image dataset,and adds attention factor based on the model proposed in this paper to enhance the sensitivity to light and color,and improve the performance of the model.The model in this paper is tested on the traffic dataset,and compared with other mainstream models,the results show that the model in this paper can effectively improve the quality of the generated traffic description text.
Keywords/Search Tags:Image Caption, Generative Adversarial Networks, Gated Recurrent Unit, Reinforcement Learning, Traffic Image
PDF Full Text Request
Related items