| Language representation learning is a vital research field that focuses on extracting semantic information from textual data,enabling computers to understand natural language.It is a critical component of artificial intelligence and has the potential to significantly improve the performance of various downstream tasks.While deep learning has made significant progress in language representation learning for resource-rich languages such as English and Chinese,there is still considerable room for improvement.Furthermore,the scarcity of annotated data for many languages worldwide presents a significant challenge for representation learning on low-resource languages.Additionally,the emergence of language-visual multimodal data has created a higher demand for language representation learning.To address these challenges,this paper proposes several solutions from different perspectives,including representation learning for resource-rich languages,representation learning for low-resource languages,and cross-modal language representation learning.The main research contents of this paper are outlined as follows:In the area of representation learning for resource-rich languages,this study tackles the issue that existing pre-training models face difficulty in comprehending sentences with complex structures.To overcome this issue,we propose a syntax-enhanced pretraining model that incorporates external syntactic knowledge.By designing a syntaxaware model structure,building a large-scale syntactic pre-training corpus,and introducing a new syntax-aware pre-training task,our model can leverage syntactic knowledge to enhance its capacity in parsing complex sentences.Experimental results on six public datasets show that our model can learn language representations of higher quality.Regarding representation learning for low-resource languages,this paper addresses the problem that existing machine reading comprehension methods struggle to accurately learn the correlations between words in different languages and the syntactic relationships among words within a single language.To overcome this issue,we propose a model that incorporates cross-lingual semantic alignment knowledge and syntactic knowledge.By constructing an external knowledge-aware graph,designing graph representation learning algorithms,and introducing a new graph-based pre-training task,our model can better utilize integrated external knowledge to learn semantic correlations between words.Experimental results on two public datasets show that the proposed method can significantly improve the quality of low-resource language representation.In the field of language-image cross-modal language representation learning,this paper addresses the challenge of accurately extracting relevant semantic information from multi-modal data to enrich language representation for multi-modal sentiment classification tasks with limited training data.To overcome this issue,we propose a model that incorporates cross-modal semantic alignment knowledge and syntactic knowledge,and constructs a knowledge-induced matrix.We use a graph convolutional neural networkbased algorithm to enable the knowledge-induced matrix to reflect multi-hop relationships,and a discretization operation is employed to filter out irrelevant connections.Afterwards,the obtained matrix is used to cut off the irrelevant connections among textual or cross-modal modalities,enabling the model to accurately extract relevant semantic information from multi-modal data even with limited training data.Experimental results on two public datasets show that our proposed method can learn language representations that contain more rich semantic information.In terms of language-video cross-modal language representation learning,this paper addresses the challenge of effectively utilizing video information to enrich language representation in video question answering tasks.Specifically,the existing methods often disregard the different semantic compositions of words,which leads to difficulty in accurately utilizing video semantic information.To tackle this issue,we introduce external syntactic knowledge to help the model better understand the semantic compositions.Firstly,we construct a syntactic hypergraph based on the syntactic knowledge and employ a hypergraph convolutional network to simulate the semantic compositions between different words.Then,we introduce an optimal-transport-based mechanism to accurately extract the relevant semantic information from the video based on different semantic compositions and propagate it to the corresponding words through the syntactic hypergraph.Experimental results on three public datasets show that our proposed method can effectively utilize video information to enrich language representation. |