Research On Fine--grained Image Captioning Method Based On Deep Learning

Posted on:2024-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:J W Liu

Full Text:PDF

GTID:2568307067493084

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Image captioning is an important research direction in the field of multimodal learn-ing.Its task goal is to generate accurate and fluent natural language descriptions for im-ages,that is,to obtain entity categories,attributes and associations between each entity from the image,and then describe them in sentences consistent with human language logic.With the continuous development of artificial intelligence technology,the image captioning method based on deep neural network has high performance,and it has a wide range of application and development space in image search,automatic image annotation,intelligent assistant driving and many other fields.In order to further improve the quality of the generated image captions,this thesis has carried out research work on the entity recognition accuracy,relationship authenticity and description detail in the image captions generation task,and optimized the main links of the image captioning task.The main work of the thesis is as follows:(1)A method of image feature semantic enhancement by multi-modal feature align-ment(MFA).Aiming at the accuracy of entity recognition,the corresponding text features are integrated into the image features to enhance the semantics of the image features and reduce the deviation of the image information in the transformation process.The experi-mental results show that the MFA algorithm can effectively improve the accuracy of the model in judging the entity category in the image and reduce errors in the generated de-scription.(2)Image entity relationship strengthening method by decoupling commonsense as-sociations(DCA).Aiming at the problem of relationship authenticity,a novel training strategy is used to endow the model with the ability to resist commonsense associations,and on this basis,a more targeted feature interaction method is used to strengthen the re-lationship information between entities.Experimental results demonstrate that the DCA model can detect and correct false commonsense relationships and generate more fluent descriptions.(3)A fine-grained image description generation method guided by part-of-speech signals(PSG).Aiming at the problem of description fineness,the language logic is used as prior knowledge to guide the model to pay attention to both the entity category and its fine-grained attribute information.Experiments show that the algorithm generates more fine-grained descriptions on MS-COCO dataset.

Keywords/Search Tags:

image captioning, multi-modal feature alignment, decoupling common-sense associations, linguistic logic priors

PDF Full Text Request

Related items

1	Research On Image Captioning Algorithm Guided By Attention And Visual Common Sense
2	Research On Visual Captioning Algorithm For “Visual-Linguistic” Cross-Modal Semantic Alignment
3	Image Captioning Theories And Methods
4	Research On Cross-Modal Image-Text Retrieval Techniques Based On Semantics And Common Sense
5	Research On Multimodal Data Modeling And Retrieval For Common Space Learning
6	Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning
7	Research On Image Captioning Models Based On Deep Learning
8	Collaborating General And Specific Semantics For Multi-feature Based Image Captioning
9	Research On Social Image Captioning Based On Deep Learning
10	Image Captioning By Multi-feature Fusion