| In the field of deep learning,how to train efficient deep learning models under limited data is a fundamental problem that needs to be faced.An efficient way to address this problem is through a pre-training and fine-tuning paradigm.Inspired by the work of BERT and GPT,domestic and foreign researchers have proposed several cross-modal pre-training and fine-tuning methods that jointly represent images and texts,but the current research cannot consider the robustness of cross-modal pre-training research from the global perspective of the model building stage,pre-training model stage and fine-tuning model stage,because the current research suffers from the following three deficiencies.First,in the model building stage,there is the problem of cross-modal fusion in a representational asymmetric scenario.The contemporaneous cross-modal pre-training studies do not achieve an end-toend learning process because the input visual information is visual regional features extracted by deep neural networks,while the textual information is not deeply characterized.Using the current cross-modal information fusion method not only introduces shallow noise information in the text,but also ignores the cross-layer interaction of cross-modal information,resulting in poor robustness of the cross-modal model.Second,in the pre-training model stage,there is the problem of sparse visual semantic representation across modalities.The existing cross-modal pre-training models use a supervised classification task in the visual branch,which limits the ability to obtain finegrained representations of cross-modal visual semantic information,and the model becomes less robust when facing fine-grained semantic changes in visual content.Third,in the fine-tuning model stage,there is the problem of poor robustness of the cross-modal model under mixed granularity attack scenarios.When faced with cross-modal downstream tasks with complex and diverse scenes,the cross-modal downstream task model is vulnerable to attacks with different perturbation granularities.The perturbation attack may come from both the small perturbation generated by the continuous space and the synonym replacement generated by the text discrete space.As a result,the model makes wrong predictions,resulting in insufficient robustness of the cross-modal model.In this thesis,the research on robustness-oriented cross-modal pretraining focuses on the following three aspects,namely cross-modal information cross-layer fusion method under representation asymmetry,crossmodal unsupervised pre-training method for dense visual semantic representation and cross-modal robustness fine-tuning method for hybrid granular attacks.And achieved the following innovations:1)Aiming at the problem of cross-modal information fusion in asymmetric representation scenarios,this study proposes a cross-layer fusion method based on quaternary complex inner product.In the existing research,visual input information usually comes from visual area features,and there are asymmetric representations of visual and textual information.The existing Transformer mechanism only fuses cross-modal information for the same layer,so the current method not only introduces textual shallow noise information,but also ignores the interaction of cross-modal information in different layers,which leads to the poor robustness of cross-modal models.This study proposes a cross-layer fusion method based on quaternion complex inner product,and uses this method to construct a Quaternion Block NetworkQBN,which solves the problem of cross-modal information fusion in asymmetric scenes.Cross-modal information cross-layer fusion is identified.Within the quaternary complex block,through multilayer content learning,multilayer relationship learning can not only remove the shallow noise information of text,but also capture the cross-layer interactions between different modalities to improve the model robustness.Dynamic scaling of visual features through text features verifies that introducing more text-related visual features can effectively improve model performance.This study uses the VQAv2 dataset to verify the proposed QBN model and sub-model.The effect of the QBN model on the visual question answering task(VQA)can surpass other models in the same period,and even surpass the effect of the early cross-modal pre-training models,verifying the effectiveness of the cross-layer fusion method based on quadratic complex inner product.2)Aiming at the problem of sparse cross-modal visual semantic representations,this study proposes the Dense Contrastive Visual-Linguistic Pretraining(DCVLP).In the existing cross-modal pre-training studies,maskbased classification and regression tasks are added to the visual area features to further improve the effect of the pre-trained model,but this method is a supervised proxy task and will introduce sparse semantics understanding,resulting in poor robustness of cross-modal models in the face of fine-grained complex problems.This study proposes the Dense Contrastive VisualLinguistic Pretraining,which solves the problem of sparse cross-modal visual semantic representations,achieves adaptive learning of cross-modal finegrained semantic co-occurrence,and ensures the robustness of the model when the visual content generates fine-grained semantic changes.This study designs two cross-modal pre-training methods for the Dense Contrastive Visual-Linguistic Pretraining:cross-modal contrast pre-training based on mask perturbation task and cross-modal contrast pre-training based on adversarial perturbation task.In this study,the proposed DCVLP method is verified in the classic single-stream model and dual-stream model.The results show that the method improves the effect of the original model significantly,which proves the wide applicability of the method.The effect of this method on multiple visual-linguistic downstream tasks can exceed other models of the same period,proving the effectiveness of the Dense Contrastive VisualLinguistic Pretraining.3)Aiming at the problem of poor robustness of cross-modal models in mixed-granularity attack scenarios,this study proposes a cross-modal finetuning method for hybrid granularity attack and defense,which solves the problem of poor cross-modal model robustness in mixed-granularity attack scenarios.The method is divided into two stages.In the attack stage,the hybrid attack method is used to generate attacks at the token granularity and the embedding granularity at the same time,and high-quality adversarial samples can be obtained in the approximate semantic space.The adversarial samples include both approximate semantic attacks and tiny perturbation attack.In the defense stage,in order to make the model resistant to hybrid attacks,this study uses distillation loss to perform knowledge distillation between the output distribution of the hybrid attack and the output distribution of the original model,providing dynamic supervision for the cross-modal fine-tuning stage and improving the downstream task model robustness.In this study,the cross-modal fine-tuning method for hybrid attack and defense is verified on multiple visual-linguistic downstream task datasets.The results show that the method can significantly improve downstream tasks,thus also validating the effectiveness of a cross-modal fine-tuning method for hybrid granularity attack and defense. |