| Image captioning is a cross-modal transformation from image visual content to natural language text,involving computer vision(CV)and natural language processing(NLP)and other related fields,and is widely used in semantic image search and multi-modal image understanding.The image aesthetic description(AIC)task tends to be more subjective,diverse and artistic.Lack of large datasets such as MSCOCO in like Image Captioning(IC)tasks.The AIC task was progressing slowly.However,in the real world,if the computer can describe the image content,it can not only accurately describe the key information of the image,but also in a smooth and beautiful form,which is of great significance to photography guidance,image intelligent image recommendation,and dialogue system.To explore this problem,a new aesthetic description method for combining image captioning and image aesthetic description is first proposed.The current deep learning model approach in the Image Caption domain is the Encoder-Decoder model,and in the training process Encoder section we apply the idea of twin networks to receive datasets from two different domains.After image information from two domains is subjected by a twin CNN network with shared parameters,the extracted image feature vector is integrated into one eigenvector such that this eigenvector contains information about two images.In the training Decoder section,we combine the text description feature corresponding to the two image information input to the twin network into a vector feature for training.This achieves the purpose of the model learning the datasets in the two domains according to the idea of machine translation.In the test stage,because the twin network of the model shares the parameters,the CNN network has the ability to extract the image visual features and image aesthetic features of the two datasets.So we removed a network,and after the CNN extracted the image features,to keep the dimension consistent,we copied the image features for fusion and sent them into the decoded Decoder model.The implementation results show that our method achieves a combination of traditional image captioning(IC)tasks and image aesthetic description(AIC)tasks,and achieves good results.Due to the weak label problem of image aesthetic subtitles,we applied deep learning Encoder-Decoder model to further learn the datasets,and the model performance is limited.In order to improve the performance of the model,we applied reinforcement learning GAN network idea,build a deep reinforcement learning model,judge the generated description using multi-modal discriminator and language style discriminator,get the reward score,build a policy gradient mechanism to further update the Encoder-Decoder model parameters,and strengthen the model learning of the data.Our problem is difficult and complex,which involves relevant techniques in the CV and NLP fields.It is a multi-modal task.the datasets involves two domains(IC)and(AIC)and is a cross-domain task.From theory to practice,we show that our aesthetic description method combining image description is more suitable for images. |