| The current research work on events mainly focuses on the event extraction task.The purpose of event extraction is to extract the trigger words and arguments of events from the text,which is a fine-grained classification task.Although some researchers have improved the event extraction task by additionally constructing external image datasets,these images do not come from the original source of the text.In this dissertation,we conduct multimodal event classification study on naturally corresponding text-image pairs on social media.Multimodal event classification is to infer the event category of the sample according to the multimodal information of text and image,which is a coarse-grained classification task.The key is to extract features from text and images and perform effective multimodal fusion.This dissertation focuses on how to achieve effective multimodal fusion,utilizing the state-of-the-art pre-trained models from three aspects,and further improving the effect of multimodal fusion.The main contents of the dissertation are as follows:(1)Multimodal Event Classification of Social Media Based on Attention MechanismThis dissertation conducts multimodal event classification research on text-image pairs of events in specific domain published by users on social media,and the text and images in these multimodal information usually have information correlation of specific events.To this end,this dissertation proposes a multimodal event classification method based on the attention mechanism to automatically focus on important information in text and images,and facilitate information interaction between different modalities.This dissertation conducts experiments on a multimodal dataset CrisisMMD about disaster events.The experimental results show that the performance of the method proposed in this dissertation is significantly better than the unimodal models,such as BERT and VGG.This method also achieves competitive performance compared with several strong baseline systems.(2)Caption-Semantics Aligned for Multimodal Event Classification of Social MediaAiming at alleviating the problem of short text in the multimodal dataset CrisisMMD,this dissertation proposes a new multimodal event classification method for semantically assisted alignment of captions from the perspective of making full use of the visual semantic features of images.In this dissertation,we first use the vision-language pre-trained model to generate captions for the images in the dataset,and use captions as auxiliary features to facilitate the learning of multimodal alignment between text and images.Experimental results show that captions can help to learn the alignment between text and image modalities,and the method proposed in this dissertation significantly outperforms the performance of three state-of-the-art visual-language pre-trained models LXMERT,VisualBERT,and ViLT.(3)Utilizing Vision-Language Pre-Trained Model for Multimodal Event Classification of Social MediaIn order to fully exploit the learning ability of pre-trained models,this dissertation proposes a multimodal event classification method based on vision-language pre-trained model.This method utilizes a unified pre-trained model CLIP to extract text and image features,and builds a Transformer encoder to achieve cross-modal interaction of higher-level fine-grained features between the text and image.The method can effectively utilize the unimodal features from the vision-language pre-trained model,realize the extraction of advanced features within the modality,and promote feature interaction between modalities.Experimental results show that the proposed method outperforms several strong baseline systems on the task of multimodal event classification.This dissertation conducts research on multimodal event classification task,and proposes corresponding solutions to related problems and to promote multimodal fusion.Experimental results show that the model designed in this dissertation improves the performance of multimodal event classification task,and some attempts of multimodal deep learning methods have been made on the multimodal event classification task. |