| Image captioning is the intersection of computer vision and natural language processing.Its task is to enable the computer to understand the visual information in the image and describe it in natural language,so as to realize the conversion of image to text.Image captioning has broad application prospects in image retrieval,navigation for the blind,human-computer interaction and other fields.Compared with traditional retrieval-based image captioning methods and template-based image captioning methods,this paper analyzes the currently widely used encoder-decoder architecture based on deep neural networks,and separately analyzes the encoder and decoder The device is improved.The main problems of the traditional encoder-decoder architecture are as follows.First,only the convolutional neural network is used In the encoder part to extract the global features in the image,or the object detection method is used to extract the object features in the image,resulting in the extracted image The visual information contained in the features is not comprehensive.The second is that only a single language LSTM used in the decoder to decode the image features,resulting in a simple input and output structure.Based on the above analysis,the main contents of this article are as follows:1)Image captioning method based on multiple features attention.This paper optimizes the encoder and attention mechanism in the encoder-decoder architecture.Compared with the traditional use of CNN to extract the global features of the image or the target feature extracted by target detection,this paper uses the graph convolutional network to extract the objects,attributes and Relationship features,using richer image features to obtain more image visual information.Then,this paper proposes a multiple features attention model,using multi-level weighting and multiple features weighting methods to design a multiple features attention module,considering the mutual influence of three image features,and conduct ablation experiments on different attention mechanisms and image feature weighting methods.At the same time,the optimal multiple features attention architecture is determined according to the highest evaluation index score.Finally,the reinforcement learning training is carried out on the optimal model to further improve the performance of the model.2)Image captioning method based on time dimension information.In order to make the image features more accurate when decoding into description sentences,this paper optimizes the decoder in the encoder-decoder architecture,and optimizes the structure of the language LSTM on the basis of the Top-Down attention model.This paper proposes an image captioning method based on time dimension information.This method includes time dimension information models based on the past and present and time dimension information models based on the present and future.Optimization is performed from two directions to solve the single dimension of the decoder.The two weaknesses of the simple structure of information input and output.Based on past and present time dimension information models,the input image information is improved through multiple language LSTM connections within one time step,and at the same time the connection between model outputs is strengthened.The weighted sum of multiple outputs can be used as the final prediction of the model to reduce the accumulation of errors,and make the headline predicted by the model more relevant to the reference sentence.The time dimension information model based on the present and the future takes into account the relevance of the words before and after the generated description sentence,and predicts two consecutive words at the same time within one time step.The final output of the model is the sum of the two predictions,and the latter output complements the previous output,making the model more accurate in predicting the title.After training with cross entropy loss,reinforcement learning training is performed on its optimal model,so that the performance of the model is further improved.This paper determines the optimal model structure through ablation experiments.Through comparative experiments on MSCOCO and other data sets,it is proved that the performance of the model exceeds mainstream models such as Att2 in,Att2all,Updown and RFNet,and the description sentences generated by the model are smooth and accurate. |