| With the rapid development of Internet technology,social media such as Twitter and Weibo have become an integral part of many people’s daily lives.Every day,people post a large number of posts on social media to express their opinions,which makes social media a platform containing a lot of user-generated content and become a platform with rich information.Information extraction is a key step to uncovering information about events,opinions,group and individual preferences from social media posts.However,since text in social media is usually short,there may be ambiguity problems due to insufficient semantics,which leads to the poor performance of traditional text-based information extraction methods in the social media domain.At the same time,users actively add images to posts in order to better express their thoughts,which has led to the emergence of multimodal information extraction methods that consider both text and image information,which can eliminate ambiguities in the text by additionally introducing image information to supplement semantics,effectively improving the performance of text-based information extraction methods and becoming an important direction in the field of information extraction.However,the current multimodal information extraction methods mainly study how to interact between modalities and how to represent image information,and there are still three problems,including(1)in modal matching,the current methods consider that image information must help text,ignoring the problem of image text mismatch;(2)in modal alignment,the current methods use two different encoders to obtain image and text representation respectively,ignoring the problem of inconsistent image text representations;(3)in modal representation,the current methods use only one type of image encoder to obtain the image representation,ignoring the problem that the image information is not fully utilized.To address the above problems,this paper investigates the general multimodal information extraction method.The method can be applied to various multimodal information extraction tasks.Specifically,in this paper,two typical multimodal information extraction tasks,multimodal named entity recognition and multimodal relation extraction,are used as examples to develop the research tasks.The main contributions are as follows:(1)To address the problem of inconsistent image text representations and the existence of mismatches between image texts,this paper proposes a general multimodal information extraction framework based on contrastive learning.The framework solves the problem of inconsistent image text representations by aligning the text and image representations through a contrastive learning approach.In addition,the framework also trains an image text matching classifier by a self-supervised approach to predict the probability of image and text matching and then uses this probability to determine the proportion of retained image information,solving the problem of image text mismatch.The effectiveness of this framework is demonstrated by experiments and case studies in this paper.(2)To address the problem of the existence of mismatches between image texts,this paper also proposes a framework for selecting multimodal information extraction models based on reinforcement learning.The framework trains a data discriminator to automatically determine whether the data is suitable for information extraction using a unimodal or multimodal model by reinforcement learning,and solves the problem of mismatch between image texts by feeding the data to the corresponding model.The experimental results demonstrate the effectiveness of the framework and its applicability to various unimodal and multimodal information extraction models.(3)To address the problem that image information is not fully utilized,this paper proposes a general multimodal information extraction framework based on visual prompt learning.The framework can fuse different image representations together and interact the fused results with text based on prompt learning,which solves the problem of image information is not fully utilized.The experimental results demonstrate the effectiveness of the framework,while this paper demonstrates through ablation studies and sample analysis that using multiple different types of image representations can effectively improve the performance of the model. |