Font Size: a A A

Research On Thangka Image Description Generation Method Based On Deep Learning

Posted on:2024-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:C Y LeFull Text:PDF
GTID:2555307055998039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As an important carrier of Tibetan culture,Thangka is a precious intangible cultural treasure of Chinese art.Its theme content covers history and culture,Tibetan medicine,Tibetan medicine,folklore,astronomy and calendar calculation,and it is also a window to explore Tibetan traditional culture and customs.The text description of Thangka images can help people understand the hidden semantic knowledge of Thangka,understand the traditional Tibetan culture,and then promote the development of ethnic culture industry,strengthen cultural confidence,and promote the growth of ethnic culture economy has important significance.In recent years,due to the rapid progress of computer science and technology,image processing technology has been widely used in the protection and inheritance of Thangka culture.This paper takes Thangka image as the research object.On the basis of deep learning method,the main research work and innovation include the following aspects:(1)The construction of Thangka data set.Through collecting thangka images from school archives,museums and the Internet,and expanding the number of thangka images by means of data enhancement,the Thangka images are divided and marked at last to make the data set required by this experiment.(2)Research on the Generation Method of Thangka Image Description Based on Transformer.Considering that existing image captioning methods lack global representation capabilities,they are not suitable for generalization of complex scenes.To this end,we conduct experiments on Transformer-based Thangka image captioning.Specifically,the Vision Transformer(Vi T)is adopted for image representation,and the global content is captured by a multi-head self-attention layer.We also use a Transformer as the decoder to continuously convert image features into sentences that fit the image.The Transformer decoder explicitly models historical words and uses a cross-attention layer to interact with image features.Experimental results on the Thangka dataset show that the method used in this chapter is superior to the current image captioning methods.(3)Research on Thangka description generation method combining multi-scale and multi-level aggregation.Considering that the target semantic objects of Thangka images are numerous,different scales,and have certain spatial distribution characteristics,and the problem that the transformer-based encoding layer is easy to lose key information of the image,this paper proposes a method of Thangka description generation based on Multi-scale and Multi-level Aggregation(MMA).In the encoding stage,asymmetric convolution is used to improve the ability of the convolution layer to obtain spatial information,and the pyramid pooling module is used to further fuse the global and local multi-scale context information of the Thangka image to obtain the feature representation with rich semantic information.In the decoding stage,a multi-level aggregation network is designed to improve the utilization of semantic information in the high-level coding layer and content information in the low-level coding layer by aggregating the features of different coding layers,which effectively solves the problem of information loss.Experimental results on the Thanka dataset show that compared with the baseline model,the proposed method improves by 14.4%,15.7%,123.6%,8.3% on BLEU-1,BLEU-4,CIDEr,METEOR,respectively.The proposed method can better extract the key information of Thanka images and generate more accurate description sentences.
Keywords/Search Tags:Thangka image, Image description, Transformer, Multi-scale feature
PDF Full Text Request
Related items