Font Size: a A A

Research On Multimodal Pre-Training Technology And Visualization Based On Lightweight Model

Posted on:2023-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:T T LiuFull Text:PDF
GTID:2568306914971689Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of deep learning and high-performance computing resources,the pre-training models based on attention mechanism have made excellent achievements in the fields of natural language processing and multimodality.However,the current pre-training models need a lot of training data,and the scale of the model is huge,which leads to its high training cost and difficult to deploy on low resource equipment.Therefore,this paper studies the multimodal pre-training model using lightweight model and a small number of data sets.The specific work is as follows:Based on the idea of curriculum learning,a new multi-stage pretraining method is proposed.Imitate the process of human learning and gradually increase the difficulty of tasks from simple to complex in stages,so as to make better use of different types of data and improve the performance of learning.In this paper,we propose a new Multi-stage Pretraining(MSP)method,which uses information at different granularities from word,phrase to sentence in both texts and images to pre-train the model in stages.At the same time,this paper designs some new pretraining tasks suitable for the information granularity of each stage for the pre-training of each stage,so as to fully capture all kinds of knowledge in the limited corpus.For example,in order to make the model fully learn the corresponding relationship between image and text,this paper designs image features random shuffle(IFRS)to make the model restore the original order of image according to the order of text end.The experimental results on multiple data sets including visual question answering,image text retrieval and other different downstream tasks show that the accuracy of this model in all downstream tasks is comparable to that of the original large model,and even exceeds that of the large model in some data sets.This paper further studies the visualization of the proposed multimodal pre-training model,and obtains some explanatory conclusions on the working principle of the model.Including:pre-training based on word granularity helps the model to realize image text alignment,pretraining based on phrase granularity helps the model to learn the attribute information of objects,etc.On this basis,a visual tool for attention distribution is constructed to visualize the internal attention distribution of the model,explore the attention distribution between single modes and multi modes,and explore how the pre-training model learns the knowledge in the language and how to use this knowledge to solve the downstream tasks.
Keywords/Search Tags:multimodal pre-training, multi-stage pre-training, pre-training task
PDF Full Text Request
Related items