| Graphs(networks)are ubiquitous in real world,which could flexibly model the interaction process in complex system,e.g.,social network,communication network,molecule graph and protein graph.Graph machine learning is a kind of machine learning methods on graph data.Among all the graph machine learning methods,graph neural networks(GNNs)have been received wide attention among the community because of its wide range of applicability and excellent performance.Current GNN methods are proposed based on IID assumption,i.e.,the training and testing graphs/nodes are drawn from the same distribution.However,due to uncontrollable collection process of real data,it is inevitable to introduce the data selection bias and result in distribution shifts between training and testing set,so the IID assumption is very difficult to be satisfied in real-world applications.Therefore,it is of great significance for applying GNNs to practical applications that guaranteeing the generalization of GNNs under data selection bias.A fundamental reason to hinder the generalization ability of GNNs on biased data is that what the GNNs learn is the correlation between the input graph data and the label.This correlation is not stable when the data is biased,leading to the degeneration of the predictive performance of the model when the testing correlation changes.Causal inference aims to discover the causal relations between variables or measure the causal effect of the input variables on labels.Such causal relation/effect is usually deemed as stable or invariant.For example,in a molecule graph,functional groups often play a decisive role in molecular properties rather than highly correlated structures such as carbon rings.Therefore,if we utilize causal inference method to constrain the GNNs to learn the invariant causal relationship in graph data instead of correlation,it will be of great helpful to improve its generalization ability.As causal inference methods are usually designed for low-dimensional data but graph is complicated Non-Euclidean data,it will face the following challenges when regularizing GNN methods with causal inference methods:(1)How to effectively incorporate the causal inference with GNN methods.(2)How to effectively learn causal relationships in graph data.(3)How could we build GNN model which has appealing inherent causal interpretability.In response to the above challenges,this research studies the GNN methods under causal regularization.First,we conduct the research on the causally constrained independent feature learning framework.Then,under the support of this framework,we study graph causal representation learning methods on node classification and graph classification problem.Finally,disentangled causal substructure learning method is further investigated to provide appealing inherent interpretability of prediction.In summary,the main contributions and innovation of this thesis are shown as follows:First,to marry the causal inference method with GNNs,we design a causal constraint feature independent learning framework.Taking clustering method as an example,the problem of data selection bias in clustering task is studied,and a decorrelation regularized clustering algorithm is proposed.The goal of decorrelation regularizer is to learn a set of sample weights to remove spurious correlations between features.At the same time,the decorrelation regularizer and the weighted k-means are jointly optimized,so that the sample weights can remove the correlation that is harmful to the clustering.The effectiveness of this feature-independent learning framework is verified on real biased data.Second,to relieve the effect of data selection bias on node classification task,a debiased GNN framework is proposed to learn the invariant relationship between node representation and labels.We first conduct both experimental study and theoretical proof to show that data selection bias will degenerate the genralization performance of GNNs.To remove parameter estimation bias,this study proposes to utilize a differential decorrelation regularization to estimate the sample weight for each label node,so that spurious correlations between the learned node embedding dimensions will be removed.Then,the learned sample weights are used to reweight the GNN model to remove estimation bias.The effectiveness of the proposed method is validated on graph data with two kinds of selection biases.Third,to generalize GNNs on out-of-distribution data for graph classification task,a general graph causal representation learning framework is proposed to learn the invariant relationship between high-level semantic variables and labels.In order to eliminate the influence of subgraph-level spurious correlations on the stable prediction of GNNs,firstly,high-level subgraph representations are extracted from the original graph data,and then we use the causal regularization term to make high-level variables independent of each other,so that the causal relationship between high-level subgraph representations and labels will be learnt.The effectiveness and interpretability of the proposed causal representation learning method are validated on a large number of simulated and real molecular graph out-of-distribution data.Fourth,to further study how to improve the inherent causal interpretability of GNNs,we propose a debiased graph neural network method based on disentangled causal substructure learning.The proposed method utilizes bias-aware and causal-aware losses to separate the disentangled causal and bias substructures,which could make accurate prediction as well as provide the causal basis based on the learned edge mask.The proposed method is validated to efficiently extract causal substructures on three newly constructed datasets that can control bias degree and are easy to interpret. |