| The task of image paragraph captioning aims to generate multiple descriptive sentences for an image.Compared with image captioning which only uses a single sentence to describe an image,image paragraph captioning needs to fully cover the contents in an image and generates finegrained description,so it needs to model more visual and linguistic details.Recently,there have been many excellent works about image paragraph captioning.In previous methods,the objects in an image are usually extracted by the object detector and represented as regional features,and then the language decoder is used to implicitly learn the relations among objects to generate a paragraph.However,these methods do not explicitly model fine-grained relations among objects and the subsequent handling is not targeted.Thus,relations among objects are not effectively captured and not fully utilized.To address the above-mentioned problems,we conduct the following works:Firstly,we propose a novel model(i.e.,DualRel)based on relations encoding and attention mechanisms.Based on the encoder-decoder framework,DualRel includes an encoding module that encodes the relations among objects and a relation-aware interaction decoding module with attention mechanisms.Specifically,the relation encoding module is composed of a spatial relation encoder and a semantic relation encoder.The spatial relation encoder captures spatial relations between overlapping objects.The semantic relation encoder captures semantic relations among objects,which is trained in a weakly supervised manner with the help of prior knowledge.The relation-aware interaction decoding module is composed of hierarchical attentions and fusion gates,which are responsible for fusing regional features of objects and relations among objects(including spatial relations and semantic relations)in the process of decoding and generating paragraphs.Secondly,a complete and large number of experiments have been conducted on a public dataset,the results show that the DualRel model significantly improves the quality of generated paragraphs and verifies the effectiveness of the model.Specifically,we first quantitatively evaluate the DualRel model using popular evaluation metrics.Then,we analyze the ablation studies and visualization of each component in the relation encoding module and relation-aware interaction decoding module.Finally,we conduct a human evaluation of the generated paragraphs.Thirdly,based on the DualRel model,aiming at the problems of few systems in the field of paragraph captioning and the bottleneck of research,we design and implement an image paragraph captioning system,which supports online selection of different models,generates description paragraphs according to the uploaded images,and returns the results for display.The evaluations of the system show that the system can generate vivid and detailed description paragraphs according to the image,and can satisfy the main needs of ordinary users and researchers. |