Font Size: a A A

Research On Image Dense Captioning Based On Deep Learing

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:W B WuFull Text:PDF
GTID:2568306104963969Subject:Engineering
Abstract/Summary:PDF Full Text Request
Image dense captioning is a relatively new research task in the field of image understanding.With the advent of the 5G era,massive videos and images will be generated every day.Manual data annotation alone cannot effectively organize,summarize and summarize these data.The image understanding can make the computer imitate the human eye to describe the image in natural language.The dense captioning is a further optimization of the image understanding task.This article studies how to effectively perform dense subtitle description of images.First,in order to enable the model to extract better features from images,this paper proposes an image dense captioning algorithm based on deep feature extraction networks.The model in this paper is mainly divided into three parts.The first part is the feature extraction network;the second part is the region of interest labeling network;the third part is the description generation network.In the feature extraction network,this paper uses a deep residual network to extract image features,so as to improve the quality of the feature vectors in the model,and then effectively improve the experimental results.Secondly,the image dense captioning description algorithm implements the description of different targets in the image.The goal is to detect visual concepts(such as objects,object parts and their interactions)densely from the image,and use a short descriptive phrase marks every concept.In previous algorithms,the model only focused on the areas of interest in the image,ignoring the links between different targets in the image.Therefore,this paper proposes an image dense captioning generation algorithm based on attention.This algorithm mimics the brain signal processing mechanism unique to human vision.Humans use limited attention resources to quickly screen out high-value information from a large amount of information.It is a survival mechanism formed by humans in long-term evolution.Human visual attention mechanism greatly improves the efficiency and accuracy of visual information processing.The core goal of this article is to select the information that is more critical to the current task objective from a lot of information,and combine the two to optimize the model.Finally,this paper proposes a dense caption description algorithm based on context features and global features.This algorithm designs a parallel LSTM network,which combines the contextual information and global information of the image when describing the target features.Dense caption description is to accurately describe the target of interest.The algorithm proposed in this paper better combines the semantic information contained in the image,allowing the model to generate different target descriptions,not only limited to the target area,by means of the entire picture is used to generate the description.The language description thus formed is more in line with people’s language habits and more accurate.
Keywords/Search Tags:Neural networks, Deep learning, Image caption, Computer vision, Image dense captioning
PDF Full Text Request
Related items