Font Size: a A A

Multimodal Pretraining Model With Weak Supervision And Momentum Distillation

Posted on:2024-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:H H ZhangFull Text:PDF
GTID:2568307118977849Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Multimodal learning technology can enhance the application value of machine learning in real-life scenarios,enable models to better adapt to and handle various complex tasks and situations,by processing different modal of data.For image-text data,most existing multimodal pretraining models use external object detectors to extract image region features on large-scale datasets,and then use these regions to pretrain with corresponding text.However,existing multimodal learning methods doesn’t handle those scenarios with insufficient training data,no bounding box annotations,and noise.To address these issues,this thesis studies the multimodal pretraining model that combines weak supervision and momentum distillation,consisting of the following two parts:(1)To address the problem of insufficient training data and the lack of bounding box annotations,a Transformer-based multimodal weak supervision pretraining model is proposed.First,a weakly supervised object localization method is introduced to obtain region features of images.Then,a Transformer-based image-text encoder framework is used to represent the multimodal features of medical images and diagnostic reports.By performing pretraining tasks such as image-text contrastive learning,image-text matching,and masked language model,the alignment of image and text is achieved.Finally,the proposed multimodal model is decoupled and applied to medical image classification tasks on various datasets,experimental results show its effectiveness.(2)To address the problem of noisy datasets,a multimodal pretraining model based on momentum distillation is proposed.First,a weakly supervised object localization method is used to obtain bounding boxes of target objects,which are then used to extract region image features.Next,momentum distillation is introduced to create a teacher model with the same structure as the student model.Region image-text pairs are input into the teacher model to generate pseudo-targets,which are used as additional supervision for the student model’s pretraining tasks.Finally,the proposed algorithm is applied to multiple medical image datasets,experimental results show its effectiveness.This thesis has 25 figures,12 tables and 89 references.
Keywords/Search Tags:multimodal learning, pretraining model, weak supervision, momentum distillation
PDF Full Text Request
Related items