Font Size: a A A

Non-local Image Caption Generation Based On Introspection Sequence Training

Posted on:2024-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2568306926468184Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of deep learning algorithms,people are no longer satisf ied with only dealing with single mode signals,and the problem involving multiple mode signals has att racted people’s attention.Image caption generation,as a cross domain between CV(Computer Vision)a nd NLP(Natural Language Processing),has attracted extensive research.The prerequisite for solving thi s problem is to have a complete model that takes into account image and language features as well as the relationship between the two.Therefore,image caption generation has always been regarded as a difficu It problem in deep learning algorithms.In the past decade or so,scientists have proposed different methods to solve this problem.The most mainstream method is the codec model.When describing the content of a simple image,this model can already accurately identify the objects in the image and infer the action status of the main targets in the i mage based on the scene content.However,compared to the accuracy and detail of human descriptions o f images,the model still has shortcomings:(1)When describing complex images containing multiple ob jects,the generated statements often tend to be biased.When the image contains multiple objects of diffe rent categories,the model will ignore the objects with smaller areas;When multiple identical objects are included,the plural form is used as a single stroke,and the lack of a specific quantity can lead to misun derstandings.(2)During the training phase,the model learns the degree of matching between the two wo rds before and after learning.When generating a description statement,it selects the word with the highe st matching rate based on the generated word inference for generation.This will result in high similarity between the sentence and the correct subtitle given,and a significant difference in sentence content,resu lting in the model scoring low on the Spice metric.In order to solve these two problems,this paper proposes to optimize the traditional image caption generation model using the idea of non local image generation.This model is mainly based on the codec model,and designs an embedded non local image feature extraction module,which enables the model t o extract more detailed features contained in the image on the original basis,helping the model generate more detailed and complete sentences,compared to the original model,This model can significantly imp rove relevant evaluation indicators.On this basis,this article further utilizes deep reinforcement learning algorithm to optimize the mod el training process to complete the task of image subtitle generation.The reinforcement algorithm can m ake the generated sentences perform well in terms of standardization and accuracy.Therefore,this articl e designs a self reflection sequence training method,which takes the average score of the Spice evaluati on index of the reference sentences learned by the model during the training process as the baseline,and the Spice score at the next moment as the result compared to it.If the score is high,it rewards its surviv al and replaces the original baseline for the next training;If you score low,you will be eliminated.This a rticle verifies the feasibility of the optimization algorithm by conducting comparative experiments on dif ferent models on the COCO dataset.The improved model can improve the scores of various evaluation i ndicators.The CIDEr score improved the most,from 91.7 to 102.7,followed by Spice score,which incr eased from 15.7 to 19.0.
Keywords/Search Tags:Image caption generation, Non local image feature extraction, Intensive learning, Ince ntive mechanism, evaluating indicator
PDF Full Text Request
Related items