Font Size: a A A

Image Captioning Based On Deep Learning And Multi-Metric Reinforcement Learning

Posted on:2021-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q LiFull Text:PDF
GTID:2518306050966359Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Image captioning is an important research topic in the fields of computer vision,natural language processing,and artificial intelligence.It has received widespread attention in some emerging applications such as human-computer interaction,medical vision impairment assistance,intelligent security warning,and social imaging entertainment.Image captioning aims to understand and extract the semantic information in natural images,and describe them in language that is accurate,smooth,and vivid like human descriptions.However,due to the rich image scenes,diverse content objects,and complex target relationships,how to effectively perceive scenes,accurately identify the contents,accurately describe the target relationships,and generate accurate,smooth,and beautiful descriptions are all great challenges for image captioning task.This thesis focuses on the lack of joint context information in the attention mechanism,the forward and single network framework,and the learning strategy without multi-metric guidance respectively.This thesis also analyzes the image captioning mechanism and uses the neural network and extracted depth information to construct a co-attention network,a feature reconstruction network and a multi-metric reinforcement learning method to effectively improve the accuracy and naturalness of the generated caption.The main research contents and results are as follows:(1)An image captioning method based on the co-attention mechanism is proposed.Firstly,in the encoder part,the features of the image are extracted by using the Res Net-101 network and the Faster R-CNN network respectively.Secondly,combining these features with the word embedding vectors of the ground-truth sentence in the training database and make it as the input of the two-layer attention network which is made up of the attention LSTM and the adjacent step co-attention model.Finally,the output of the attention network is passed through the language LSTM as well as the Softmax layer in order to generate the final description sentence.The experimental results prove that this method improves the accuracy of target recognition in the image as well as the correlation between targets.(2)An image captioning method based on the feature reconstruction is presented.Firstly,from the global and local perspectives respectively,a feature reconstruction network based on the holistic selection mechanism or the partial selection mechanism is designed.Secondly,a network architecture is built with the encoder,attention network,decoder and the feature reconstructor,which not only considers the forward generation process from image to description sentence,but also implements the course from caption to image features by the reverse feature reconstruction process.Finally,the loss function is further enriched with a reconstruction feature difference calculation formula that can measure the two-way matching degree between the image and caption.The experimental results prove that this method improves the accuracy of target recognition for the image as well as the correlation between targets.(3)An image captioning method based on the multi-metric reinforcement learning is proposed.Firstly,combining the image-caption level and caption-caption level to design a two-stage multi-metric reinforcement learning method.In the first training stage,the end-toend network is pre-trained with a loss function consisting of the cross-entropy formula and the reconstructed feature difference value.Then,in the second training stage,the crossentropy formula is replaced with a multi-metric reward function to form a new loss function for the network fine-tuning.Finally,the sensitivity analysis of the measurement metrics as well as the sufficient experiments verify the superiority of the method in both objective and subjective evaluation.This thesis proposes the improved models and methods from the different three aspects of the image captioning task,which are co-attention mechanism,feature reconstruction and network training strategy.These three methods can also be superimposed.In the experiments,offline or online evaluation is performed from both quantitative and qualitative aspects.The results show that the three methods in this thesis not only greatly improves the values of the objective evaluation indicators,but also makes the subjective perception of the generated description sentence more accurate.
Keywords/Search Tags:Image Captioning, Co-attention Mechanism, Feature Reconstruction, Multi-metric Reinforcement Learning, Deep Learning
PDF Full Text Request
Related items