| Visual captioning task is a fundamental problem in the field of visual scene understanding,aiming to deeply comprehend the target and environment information in the visual scene and generate a text description that matches the visual content and conforms to the rules of human language.It has significant theoretical significance and application value in many fields,such as generative artificial intelligence and intelligent situation awareness.However,the practical application scenes are pretty complex,such as the dense object distribution,the diverse object types,and the difficulties in data collection and annotation.Moreover,there are significant modal differences between vision and text.The above situations result in the generated descriptions suffering many problems such as low structural integrity,insufficient content accuracy,lack of adequacy,and excessive data dependence.Therefore,based on the above analysis,this dissertation focuses on the theoretical and methodological research of visual captioning,aiming to deeply explore the semantic feature encoding and decoding.Then,this dissertation further explores the semi-supervised visual captioning and unsupervised visual captioning to reduce the dependence problems on large-scale annotated data.The main research contents and related innovations can be summarized as follows:(1)To address the issue of poor structural integrity of generated descriptions caused by the diversity of language,this dissertation proposes a part-of-speech(POS)dynamic aware encoding captioning model.Firstly,it proposes a POS aware feature extraction module,which directionally parses complex image visual information into POS specific visual information.Then,it designs a POS semantic dynamic fusion encoder,which can predict POS labels and achieve dynamic fusion of POS visual features based on the hidden state of text.Finally,a POS visual feature guided captioning network is designed to generate words that match the POS labels,improving the accuracy of generated descriptions and the integrity of sentence structure.(2)Since there are lots of information in complex visual scene,the generated descriptions might miss some details.To solve this problem,this dissertation designs a multi-level object attribute encoding captioning model.It first constructs a Crowd Caption dataset for visual captioning,which includes typical dense scene images and text descriptions with much detail information.Meanwhile,this dissertation designs a multilevel object attribute feature encoder,which utilizes attribute classification task to assist semantic feature encoding and mines fine-grained attribute information of objects.Finally,by constructing visual-text multimodal attention interaction,it achieves the feature fusion of fine-grained attribute semantic information,and it supplements fused features into the visual captioning process to further enhance the semantic association between visual and text,and enhance the detail of generated descriptions.(3)To address the problem of incomplete and insufficient understanding and captioning of visual scenes,this dissertation proposes object collection decoding model for multiple perspectives visual captioning.It first designs a foreground visual feature extraction module to mine global visual features and foreground visual features of the input image.Then,it builds a collection guided decoder,which utilizes the idea of discretizing coordinates to transform the collection localization regression problem into a coordinate classification problem,achieving the unity of collection localization task and visual captioning task.Finally,by constructing a mapping relationship between collection localization task and captioning task,multiple perspectives visual captioning can be achieved.It effectively improves the sufficiency of visual scene understanding.(4)Since it is difficult to achieve accurate semantic mapping because of the modalities gap between vision and text,this dissertation proposes a dual prompt-based decoding model with scene and object assistance for visual captioning task.It first constructs a pre-setting object space and utilizes the matching ability of visual language pre-training models to build object prompt information.Then,this dissertation proposes a multi-scale dual prompts prior knowledge extraction module to extract semantic information at different scales and predict scene categories at the same time,which can obtain scene and object dual prompts information as prior knowledge.Finally,by constructing a dual prompts assisted decoder,an explicit connection can be established between vision and text,reducing the differences between vision and text modalities.It improves the accuracy of semantic mapping,and achieves accurate visual captioning.(5)In response to the problem of insufficient model learning caused by data scarcity,this dissertation conducts research on semi-supervised visual captioning based on trident pseudo labels generation.It constructs a vision,source domain and target domain triplet semantic relationship,and designs a target domain style guided decoding method to construct target domain pseudo labels for a large amount of data in the source domain.Then,this dissertation designs a semi-supervised pseudo labels filter to obtain high-quality of pseudo labels for data expansion in the target domain by establishing a series of filter rules.Finally,based on high-quality pseudo labels and the two-stage training strategy,it effectively reduces model’s dependence on annotated data and improves the captioning ability in the target domain.(6)In the data scarce scenes,it is sometimes difficult to obtain visual information during the training stage,which leads the problem of poor semantic consistency between training and inference.To overcome the above problem,this dissertation proposes an unsupervised visual captioning method based on visual semantic reproduction and enhancement.It breaks through the limitations of traditional unsupervised captioning frameworks and establishes a semantic mapping from text to vision during the training stage.Then,it constructs a common text semantic space to mine neighboring semantic features as auxiliary information,achieving the information enhancement for input data.It effectively improves the representation ability of encoded features,while achieving semantic alignment for training and inference.Finally,the robustness of decoding model is improved by random masking,and it achieves accurate unsupervised visual captioning. |