Font Size: a A A

Research On Information Extraction Based On Multimodal Features

Posted on:2023-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:S Z WeiFull Text:PDF
GTID:2568307061453874Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Mobile Internet,social platforms such as Twitter and Weibo have won the favor of the majority of users due to their convenience and sharing.People can easily express their opinions and share their daily life on social platforms.These tweets consist of not only text,but also pictures uploaded by users to enhance their emotions.Information Extraction task aims to extract specific structured information such as entities,relationships and events from natural language text.This task can provide data support for downstream applications such as knowledge graph,automatic question answering,recommendation system and so on.Twitter text is generally short and noisy,and the expression is not standardized.Yet the picture is highly correlated with the text and can make up for the lack of text expression.Therefore,the traditional information extraction methods based on single text are no longer applicable and multimodal information extraction that leverages pictures has become a research hotspot in recent years.This thesis studies three information extraction tasks based on text and image including Named Entity Recognition,Social Relationship Extraction and Entity Linking.The specific research contents are as follows:Firstly,the thesis proposes a multimodal named entity recognition approach based on targeted visual guidance.This approach constructs a unified multimodal graph neural network between the input text and pictures.Each node in the graph represents a semantic unit,which means a textual word or a visual object detected by the toolkit.Two kinds of edges are set up respectively to capture the relationships of semantic units between the same modality and different modalities.After that,based on the graph,multiple multimodal feature fusion layers are stacked to perform node interaction iteratively.For nodes in the same modality,Transformer is used to directly capture the dependencies among them.For nodes in different modalities,a cross-modal gating mechanism is adopted to collect semantic information of their cross-modal neighbor nodes.Finally,the enhanced text representation is decoded by CRF to extract named entities.Experimental results show that this approach achieves better performance than other benchmark models in dealing with multimodal named entity recognition tasks.Subsequent ablation studies further verify the effectiveness of this approach.Secondly,the thesis proposes a multimodal graph fusion method for social relation extraction based on syntactic and facial features.This method integrates three kinds of syntactic information: part of speech,dependency edge and dependency label at the text level,and leverages Transformer to model the implicit association information of faces from head and tail entities at the image level.In order to construct a multimodal graph neural network,the word vectors corresponding to the head and tail entities are maximally pooled into two textual nodes,and the corresponding facial representations are set as two visual nodes.At the same time,each textual node is connected with the other two visual nodes,and each visual node is connected with the other two textual nodes.Afterwards,the cross-modal attention mechanism is used to realize the fusion of multimodal features.In addition,due to the uneven distribution of samples in the datasets,samples corresponding to many social relationship categories is sparse,so this thesis performs few-shot learning with classical prototypical network.Experimental results show that the proposed method can effectively incorporate syntactic and facial features,and generate higher-quality text embeddings by multimodal fusion.Under various experimental settings of few-shot learning,classification accuracy of the model is significantly ahead of other baselines.Finally,the thesis proposes an image-text pre-training and prompt-based fine-tuning method for multimodal entity linking.Limited by the high cost of manual annotation,this thesis firstly runs an script program to automatically construct a multimodal entity linking dataset according to the characteristics of tweets and conducts experiments on this basis.In the multimodal pre-training stage,this method designs two tasks: masked word prediction and image-text alignment based on BERT model.In the fine-tuning stage,due to the small number of annotation samples,to make full use of the knowledge learned by the pre-trained model,this thesis constructs the prompt template,which is consistent with the tasks of the pre-training stage,so that semantic information can be obtained directly from the pre-trained model as much as possible.Experimental results show that multimodal pre-training can greatly improve the performance of downstream tasks,and prompt-based fine-tuning can achieve better results in low-resource learning.Subsequent ablation studies further demonstrate the rationality of the designed pretraining tasks and the generalizability of the pre-trained model.
Keywords/Search Tags:Multimodal, Information Extraction, Graph Neural Network, Feature Fusion, Multimodal Pre-training
PDF Full Text Request
Related items