| The similarity detection of scientific documents can provide important technical support for the formal review of scientific project application by establishing a detection model to calculate the degree of similarity between scientific documents.Existing text matching models rely excessively on labeled data and text representation models are simple and flexible but have limited feature capture.BERT have powerful feature capture but do not take into account the association between documents.In order to make full use of the rich entity and relationship information between scientific documents,a self-supervised data-enhanced contrastive learning fine-tuning framework is proposed in this thesis,using the BERT as the base model.At the same time,knowledge graph is introduced on this basis,and a representation model for metapath based text interaction graphs using mutual information is proposed.The main research contents and innovations of this thesis are as follows:(1)Relying on the project about research on key technologies of power knowledge graph,this thesis constructs a power scientific documents dataset and a power knowledge graph.Furthermore,a similarity detection model for scientific documents is designed to provide technical support for the scientific project review service of the knowledge graph service platform.(2)To address the poor performance of the BERT model in similarity tasks,an intra-text and inter-text contrast learning pre-training strategy is introduced,and a self-supervised data-enhanced contrast learning finetuning framework is proposed.The method uses the BERT as the base model,introduces a whole-word masking mechanism in the pre-training stage,and employs the intra-text and inter-text sentence pair sampling strategies to reduce the contrast loss.At the same time,text features are learned from the granularity of words and sentences to improve the anisotropy problem.In the fine-tuning stage,two types of data enhancement,feature reconstruction and text reconstruction,are combined with contrast learning to capture text-level features and thus distinguish dissimilar texts.(3)In order to make full use of the semantic and knowledge features of scientific documents,a representation model for metapath based text interaction graphs using mutual information is proposed.The model sets up different semantic metapaths based on the relationship between entities to build text interaction graphs of scientific documents.Text interaction graphs and text semantic representation vector are used as initial features,and then input into the graph convolutional neural network for node representation learning.The final representation is learned and aggregated through attention scores based on the node embeddings under different metapath.In the training process,in order to prevent the model from focusing too much on the local adjacent features of nodes,"Local-Global" mutual information is introduced to enhance the global similarity of nodes.The experimental results show the effectiveness of the proposed method.There are 36 figures,16 tables,and 71 references in this thesis. |