Font Size: a A A

Research On Fine--grained Image Captioning Method Based On Deep Learning

Posted on:2024-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:J W LiuFull Text:PDF
GTID:2568307067493084Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Image captioning is an important research direction in the field of multimodal learn-ing.Its task goal is to generate accurate and fluent natural language descriptions for im-ages,that is,to obtain entity categories,attributes and associations between each entity from the image,and then describe them in sentences consistent with human language logic.With the continuous development of artificial intelligence technology,the image captioning method based on deep neural network has high performance,and it has a wide range of application and development space in image search,automatic image annotation,intelligent assistant driving and many other fields.In order to further improve the quality of the generated image captions,this thesis has carried out research work on the entity recognition accuracy,relationship authenticity and description detail in the image captions generation task,and optimized the main links of the image captioning task.The main work of the thesis is as follows:(1)A method of image feature semantic enhancement by multi-modal feature align-ment(MFA).Aiming at the accuracy of entity recognition,the corresponding text features are integrated into the image features to enhance the semantics of the image features and reduce the deviation of the image information in the transformation process.The experi-mental results show that the MFA algorithm can effectively improve the accuracy of the model in judging the entity category in the image and reduce errors in the generated de-scription.(2)Image entity relationship strengthening method by decoupling commonsense as-sociations(DCA).Aiming at the problem of relationship authenticity,a novel training strategy is used to endow the model with the ability to resist commonsense associations,and on this basis,a more targeted feature interaction method is used to strengthen the re-lationship information between entities.Experimental results demonstrate that the DCA model can detect and correct false commonsense relationships and generate more fluent descriptions.(3)A fine-grained image description generation method guided by part-of-speech signals(PSG).Aiming at the problem of description fineness,the language logic is used as prior knowledge to guide the model to pay attention to both the entity category and its fine-grained attribute information.Experiments show that the algorithm generates more fine-grained descriptions on MS-COCO dataset.
Keywords/Search Tags:image captioning, multi-modal feature alignment, decoupling common-sense associations, linguistic logic priors
PDF Full Text Request
Related items