Using computational methods to predict disease-associated genes can help reveal the molecular mechanisms of disease,provide guidance for verifying the relationship between genes and diseases through biological experiments,and reduce experimental costs.Complex life activities are often mediated by molecular cooperation and interactions,so many approaches have been used to uncover disease-associated genes through the construction and use of molecular networks.With the advancement of research on disease-related genetic factors and sequencing technologies,it has become possible to fuse multiple networks and construct a heterogeneous network containing associations and interactions between diseases and multiple biomolecules.The introduction of heterogeneous networks has made it possible to analyze and understand diseases at multiple levels and to identify potential genetic disease associations.The heterogeneity of a heterogeneous network means that its nodes and edges belong to different types and have different characteristics and properties.How to better exploit the information embedded in different types of nodes and connected edges is a major challenge for prediction algorithms based on heterogeneous networks.Existing methods make more use of meta-path strategy,but the use of meta-paths relies on a priori knowledge in a specific domain and also restricts the free access from nodes to their neighboring nodes.Graph neural network models on heterogeneous graphs provide new ideas for mining heterogeneous biological networks.In this thesis,we introduce an edge type-aware graph neural network model HGT to extract node features for predicting the association between coding genes/lncRNAs and diseases on a heterogeneous network of disease-biological molecular associations,and obtain an extensible heterogeneous network-based gene-disease association prediction framework GeDi-HGT,which takes the heterogeneous network and the semantic vectors of nodes in the network as input,and uses HGT models the input,fuses the information contained in the network topology and the text semantics of the nodes to create vector representations of genes and diseases,and uses the created vectors to calculate association scores for candidate gene-disease pairs to predict disease-associated coding genes and lncRNAs.In this work we construct a heterogeneous network containing four different object types including coding gene,lncRNA,micro RNA and disease and six different relationship types among them.For each node in the network,we collect its description text and use Bio BERT to generate a semantic vector.The process of aggregation and propagation of nodes’ neighborhood information by the method in this thesis uses an attention mechanism to assign weights to different neighborhood nodes,and the calculation process is an edge type-aware calculation by considering the difference in the types of edges between nodes and their neighborhood nodes and creating a separate parameter matrix for each type of edge.The calculation process of edge type-awareness is differentiated at the level of edges,which fully takes the heterogeneity of the network into account and is more flexible than the meta-path strategy.In the comparison experiment,GeDi-HGT reached 0.8073 on AUC and 0.8106 on NDCG,which outperforms gene-disease association prediction algorithms based on random walks,heterogeneous graph embedding and graph neural network on homogeneous networks.It is further demonstrated through ablation experiments that using text semantics as input of the graph neural network can improve prediction accuracy.By analyzing the model’s information propagation paths on the heterogeneous network,we found that the model can learn biologically meaningful propagation paths.In addition,we extracted the 10 coding genes and lncRNAs with the highest association scores with breast cancer,respectively,most of which are reported in databases and the literature,indicating that the method proposed in this thesis can effectively predict disease-related coding genes and lncRNAs,providing a strong guide for subsequent biological experimental validation. |