Paper information
Paper title :Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training Data
Author of the paper :Qi Zhu, Natalia Ponomareva, Jiawei Han, Bryan Perozzi
Source of the paper :2021, NeurIPS
Address of thesis :download
Paper code :download
1 Introduction
Semi supervised learning uses the relationship between data ( Namely, edge connection relationship , There will be inductive bias ), And a set of labeled samples , To predict the rest of the tag .
The problems of semi supervised learning : The data distribution of training data set and test data set is inconsistent , Easy to produce Over fitting 、 The problem of poor generalization . When the data set is too small or too large , Select a part of the marked subset for training , This kind of problem is quite obvious .
say concretely , Our contribution is as follows :
1. We provide the first focused discussion on the distributional shift problem in GNNs.
2. We propose generalized framework, Shift-Robust GNN (SR-GNN), which can address shift in both shallow and deep GNNs.
3. We create an experimental framework which allows for creating biased train/test sets for graph learning datasets.
4. We run extensive experiments and analyze the results, proving that our methods can mitigate distributional shift.
2 Related Work
Standard learning theory assumes that training and reasoning data come from the same distribution , But in many practical cases , This is not true . In transfer learning , Domain adaptation (Domain adaptation) The problem involves removing knowledge from the source domain ( For learning ) Move to target domain ( The final distribution of reasoning ).
[3] As a pioneering work in this field, a model-based approach in Source domain and Target domain The distance metric function is used to quantify the similarity between the two domains . To get the final model , An intuitive idea is to train the model based on the weighted combination of source data and target data , Where the weight is the quantization function of the domain distance .
3 Distributional shift in GNNs
SSL classifier , The cross entropy loss function is usually used $l$:
$\mathcal{L}=\frac{1}{M} \sum\limits_{i=1}^{M} l\left(y_{i}, z_{i}\right)$
When training data and test data come from the same domain $\operatorname{Pr}_{\text {train }}(X, Y)=\operatorname{Pr}_{\text {test }}(X, Y)$ when , The trained classifier performs well .
3.1 Data shift as representation shift
Based on the basic assumptions of standard learning theory $\operatorname{Pr}_{\text {train }}(Y \mid Z)=\operatorname{Pr}_{\text {test }}(Y \mid Z)$, The main reason for the distributed displacement is that it represents the displacement , namely
$\operatorname{Pr}_{\text {train }}(Z, Y) \neq \operatorname{Pr}_{\text {test }}(Z, Y) \rightarrow \operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$
This article focuses on the representation of training data sets and test data sets $Z$ Distribution transfer between .
To measure this change , You can use MMD[8] or CMD[37] Etc .CMD Measure the distribution $\mathrm{p}$ and $\mathrm{q}$ Direct distance between , as follows :
$\mathrm{CMD}=\frac{1}{|b-a|}\|\mathrm{E}(p)-\mathrm{E}(q)\|_{2}+\sum\limits _{k=2}^{\infty} \frac{1}{|b-a|^{k}}\left\|c_{k}(p)-c_{k}(q)\right\|_{2}$
among
- $c_{k}$ On behalf of the $k$ Moment of order center , Usually $k=5$ ;
- $a$、$b$ Represents the joint distribution support of these distributions ;
The greater the value in the above formula, the greater the distance between the two fields .
Defined in this article GNNs by $H^{k}=\sigma\left(H^{k-1} \theta^{k}\right)$, Conventional GNNs by $H^{k}=\sigma\left(\tilde{A} H^{k-1} \theta^{k}\right)$.
Conventional GNNs Due to the use of normalized adjacency matrix , Lead to inductive bias , So that changed Represents the distribution of . So in semi supervised learning , because Graph induction and sampling feature vector offset , It is difficult to generate large performance interference with cheap training samples .
In form , The analysis of the distributed displacement is as follows :
Definition 3.1 (Distribution shift in GNNs). Assume node representations $Z=\left\{z_{1}, z_{2}, \ldots, z_{n}\right\}$ are given as an output of the last hidden layer of a graph neural network on graph $G$ with n nodes. Given labeled data $\left\{\left(x_{i}, y_{i}\right)\right\}$ of size $M$ , the labeled node representation $Z_{l}=\left(z_{1}, \ldots, z_{m}\right)$ is a subset of the nodes that are labeled, $Z_{l} \subset Z$ . Assume $Z$ and $Z_{l}$ are drawn from two probability distributions $p$ and $q$. The distribution shift in GNNs is then measured via a distance metric $d\left(Z, Z_{l}\right)$
Figure 1 It shows that the influence of distribution shift caused by sample deviation directly reduces the performance of the model . By using nodes GCN The model plots the distributed displacement distance values of three data sets ( $x$ Axis ) And the corresponding model accuracy ( $y$ Axis ) The relationship between .
It turns out that ,GNN The performance of node classification on these datasets is inversely proportional to the size of the distribution displacement , And inspired us to study the distributed displacement .
4 Shift-Robust Graph Neural Networks
This section begins with two types of GNN The model solves the problem of distributed displacement ($\operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$, Then a general framework is proposed to reduce the distributed displacement 2 problem .
ben
ji
4.1 Scenario 1: Traditional GNN models
Tradition GNN Model (GCN) $\Phi$ contain Learnable functions $\mathbf{F}$ , Parameters $\Theta$ , Adjacency matrix $A$ :
$\Phi=\mathbf{F}(\Theta, A)$
stay GCN in , The inductive deviation of a graph is multiplicative at each level , And the gradient propagates back in all layers . The nodes generated by the last layer are represented as :
$Z \equiv Z_{k}=\Phi\left(\Theta, Z_{k-1}, A\right)$, $Z_{k} \in[a, b]^{n}$, $Z_{0}=X$
The training sample $\left\{x_{i}\right\}_{i=1}^{M}$ The node of is represented as $Z_{\text {train }}=\left\{z_{i}\right\}_{i=1}^{M}$. For test samples , Extract an unbiased from unlabeled data IID Sample set $X_{\text {IID }}=\left\{x_{i}^{\prime}\right\}_{i=1}^{M}$, And express the output as $Z_{\text {IID }}=\left\{z_{i}^{\prime}\right\}_{i=1}^{M}$.
To lighten the training and Distribution displacement between test samples , This paper presents a regularizer $d:[a, b]^{n} \times[a, b]^{n} \rightarrow \mathbb{R}^{+}$ Used to add to the cross entropy loss . because $\Phi$ It 's completely differentiable , The distributed displacement metric can be used as the regularization , To directly minimize biased and unbiased IID Differences between samples :
$\mathcal{L}=\frac{1}{M} \sum_{i} l\left(y_{i}, z_{i}\right)+\lambda \cdot d\left(Z_{\text {train }}, Z_{\text {IID }}\right)$
Here, the measurement of the distribution displacement is Central moment difference regularizer (central moment discrepancy regularizer)$d_{\mathrm{CMD}}$:
$d_{\mathrm{CMD}}\left(Z_{\text {train }}, Z_{\mathrm{IID}}\right)=\frac{1}{b-a}\left\|\mathbf{E}\left(Z_{\text {train }}\right)-\mathbf{E}\left(Z_{\mathrm{IID}}\right)\right\|+\sum\limits_{k=2}^{\infty} \frac{1}{|b-a|^{k}}\left\|c_{k}\left(Z_{\text {train }}\right)-c_{k}\left(Z_{\mathrm{IID}}\right)\right\|$
among ,
- $\mathbf{E}(Z)=\frac{1}{M} \sum_{i} z_{i}$;
- $c_{k}(Z)=\mathbf{E}(Z-\mathbf{E}(Z))^{k}$ yes $k$ Moment of order center ;
4.2 Scenario 2: Linearized GNN Models
Linearization GNN The model uses two different functions : One for nonlinear feature transformation , The other is used in the linear graph expansion stage :
$\Phi=\mathbf{F}_{\mathbf{2}}(\underbrace{\mathbf{F}_{\mathbf{1}}(\mathbf{A})}_{\text {linear function }}, \Theta, X)$
among , Linear function $\mathbf{F}_{\mathbf{1}}$ Combine the graph induced deviation with the node characteristics , Then it is given to the feature coder of multilayer neural network $\mathbf{F}_{\mathbf{2}}$ decoupling .SimpleGCN[34] in $\mathbf{F}_{\mathbf{1}}(A)=A^{k} X$ . Another branch of linearized models [16,4,36] use personalized pagerank To pre calculate the information diffusion in the graph ( $\mathbf{F}_{\mathbf{1}}(A)=\alpha(I-(1-\alpha) \tilde{A})^{-1}$ ), And apply it to the coded node characteristics $F(\Theta, X)$.
These two models , The graph induces the deviation as a linear function $\mathbf{F}_{\mathbf{1}}$ Feature input for . But there are no learnable layers in enough stages , Therefore, the proposed distributed regularizer cannot be simply used .
In both models , The graph induces the deviation as linear $\mathbf{F}_{\mathbf{1}}$ The input feature of provides . Unfortunately , Because there is no layer to learn at this stage of these models , Therefore, we cannot simply apply the distributed regularizer proposed in the previous section .
under these circumstances , Training and testing samples can be considered as coming from $\mathbf{F}_{\mathbf{1}}$ Row level sample of , Then the distribution displacement $\operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$ The problem is transformed into the inductive deviation feature space of matching training and test graphs $h_{i} \in \mathbb{R}^{n}$. In order to generalize from training data to test data , The sample weighting scheme can be used to correct the deviation , Such a biased training sample $\left\{h_{i}\right\}_{i=1}^{M}$ Will be similar to IID sample $ \left\{h_{i}^{\prime}\right\}_{i=1}^{M}$. The cross entropy loss thus obtained is
$\mathcal{L}=\frac{1}{M} \beta_{i} l\left(y_{i}, \Phi\left(h_{i}\right)\right)$
among ,
- $\beta_{i}$ Is the weight of each training instance ;
- $l$ It's cross entropy loss ;
then , By solving a Kernel mean matching (KMM)[9] To calculate the optimal $\beta$:
$\min _{\beta_{i}}\left\|\frac{1}{M} \sum\limits_{i=1}^{M} \beta_{i} \psi\left(h_{i}\right)-\frac{1}{M^{\prime}} \sum\limits_{i=1}^{M^{\prime}} \psi\left(h_{i}^{\prime}\right)\right\|^{2} \text {, s.t. } B_{l} \leq \beta<B_{u}$
$\psi: \mathbb{R}^{n} \rightarrow \mathcal{H}$ Means by the core $k$ Introduced reproducing kernel Hilbert space(RKHS) Feature mapping of . In the experiment , The author uses mixed Gaussian kernel function $k(x, y)=\sum_{\alpha_{i}} \exp \left(\alpha_{i}\|x-y\|_{2}\right)$, $\alpha_{i}=1,0.1,0.01 $. Lower limit $B_{l}$ And the upper limit $B_{u}$ The existence of constraints is to ensure that most samples obtain reasonable weights , Not just a few samples Get a non-zero weight .
There are multiple classes in the actual label space . In order to prevent $\beta$ The resulting label imbalance , Further specific requirements $c$ Class $\beta$ The sum of Keep the same before and after correction $\sum_{i}^{M} \beta_{i} \cdot \mathbb{I}\left(l_{i}=c\right)=\sum_{i}^{M} \mathbb{I}\left(l_{i}=c\right), \forall c$ .
4.3 Shift-Robust GNN Framework
Now we propose Shift-Robust GNN(SR-GNN)- We solve GNN The general training goal of distribution transfer in :
$\mathcal{L}_{\text {SR-GNN }}=\frac{1}{M} \beta_{i} l\left(y_{i}, \Phi\left(x_{i}, A\right)\right)+\lambda \cdot d\left(Z_{\text {train }}, Z_{\text {IID }}\right)$
The framework consists of a regularization component for dealing with distribution transfer in the learnable layer ( The first 4.1 section ) And an instance reweighting component , This component can handle the case that the graph induction deviation is added after the feature coding ( The first 4.2 section ).
Now? , We will discuss a concrete example of our framework , The example is applied to APPNP[16] Model .APPNP The definition of the model is :
$\Phi_{\text {APPNP }}=\underbrace{\left((1-\alpha)^{k} \tilde{A}^{k}+\alpha \sum\limits_{i=0}^{k-1}(1-\alpha)^{i} \tilde{A}^{i}\right)}_{\text {approximated personalized page rank }} \underbrace{\mathbf{F}(\Theta, X)}_{\text {feature encoder }}$
First, in the node feature $X$ Apply feature encoder on $\mathbf{F}$, And linear approximation personalized pagerank matrix. therefore , We have $h_{i}=\pi_{i}^{\mathrm{ppr}}$, among $\pi_{i}^{\mathrm{ppr}}$ Is a personalized page vector . So , We use instance weighting to reduce the distribution shift caused by graph induction bias . Besides , Give Way $Z=\mathbf{F}(\Theta, X)$ And we can further reduce the distributed displacement of the nonlinear network $d$. In our experiment , We showed SR-GNN In the other two representative GNN Application on the model :GCN[15] and DGI[32].
5 Experiments
experiment
5 Conclusion
For semi supervised learning , Consider the representation distribution consistency problem .
Revise history
2022-06-24 Create articles
Contents of thesis interpretation