当前位置：网站首页>Sr-gnn shift robot gnns: overlapping the limitations of localized graph training data

Sr-gnn shift robot gnns: overlapping the limitations of localized graph training data

2022-06-24 18:45:00 【Learner-】

Paper information

Paper title ：Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training Data
Author of the paper ：Qi Zhu, Natalia Ponomareva, Jiawei Han, Bryan Perozzi
Source of the paper ：2021, NeurIPS
Address of thesis ：download
Paper code ：download

1 Introduction

　　 Semi supervised learning uses the relationship between data （ Namely, edge connection relationship , There will be inductive bias ）, And a set of labeled samples , To predict the rest of the tag .

　　 The problems of semi supervised learning ： The data distribution of training data set and test data set is inconsistent , Easy to produce Over fitting 、 The problem of poor generalization . When the data set is too small or too large , Select a part of the marked subset for training , This kind of problem is quite obvious .

　　 say concretely , Our contribution is as follows ：

　　1. We provide the first focused discussion on the distributional shift problem in GNNs.
　　2. We propose generalized framework, Shift-Robust GNN (SR-GNN), which can address shift in both shallow and deep GNNs.
　　3. We create an experimental framework which allows for creating biased train/test sets for graph learning datasets.
　　4. We run extensive experiments and analyze the results, proving that our methods can mitigate distributional shift.

2 Related Work

　　 Standard learning theory assumes that training and reasoning data come from the same distribution , But in many practical cases , This is not true . In transfer learning , Domain adaptation （Domain adaptation） The problem involves removing knowledge from the source domain （ For learning ） Move to target domain （ The final distribution of reasoning ）.

　　[3] As a pioneering work in this field, a model-based approach in Source domain and Target domain The distance metric function is used to quantify the similarity between the two domains . To get the final model , An intuitive idea is to train the model based on the weighted combination of source data and target data , Where the weight is the quantization function of the domain distance .

3 Distributional shift in GNNs

　　SSL classifier , The cross entropy loss function is usually used $l$：

　　　　$\mathcal{L}=\frac{1}{M} \sum\limits_{i=1}^{M} l\left(y_{i}, z_{i}\right)$

　　 When training data and test data come from the same domain $\operatorname{Pr}_{\text {train }}(X, Y)=\operatorname{Pr}_{\text {test }}(X, Y)$ when , The trained classifier performs well .

3.1 Data shift as representation shift

　　 Based on the basic assumptions of standard learning theory $\operatorname{Pr}_{\text {train }}(Y \mid Z)=\operatorname{Pr}_{\text {test }}(Y \mid Z)$, The main reason for the distributed displacement is that it represents the displacement , namely 　　　

　　　　$\operatorname{Pr}_{\text {train }}(Z, Y) \neq \operatorname{Pr}_{\text {test }}(Z, Y) \rightarrow \operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$

　　 This article focuses on the representation of training data sets and test data sets $Z$ Distribution transfer between .

　　 To measure this change , You can use MMD[8] or CMD[37] Etc .CMD Measure the distribution $\mathrm{p}$ and $\mathrm{q}$ Direct distance between , as follows ：

　　　　$\mathrm{CMD}=\frac{1}{|b-a|}\|\mathrm{E}(p)-\mathrm{E}(q)\|_{2}+\sum\limits _{k=2}^{\infty} \frac{1}{|b-a|^{k}}\left\|c_{k}(p)-c_{k}(q)\right\|_{2}$

　　 among

- $c_{k}$ On behalf of the $k$ Moment of order center , Usually $k=5$ ;
- $a$、$b$ Represents the joint distribution support of these distributions ;

　　 The greater the value in the above formula, the greater the distance between the two fields .

　　 Defined in this article GNNs by $H^{k}=\sigma\left(H^{k-1} \theta^{k}\right)$, Conventional GNNs by $H^{k}=\sigma\left(\tilde{A} H^{k-1} \theta^{k}\right)$.

　　 Conventional GNNs Due to the use of normalized adjacency matrix , Lead to inductive bias , So that changed Represents the distribution of . So in semi supervised learning , because Graph induction and sampling feature vector offset , It is difficult to generate large performance interference with cheap training samples .

　　 In form , The analysis of the distributed displacement is as follows ：

　　Definition 3.1 (Distribution shift in GNNs). Assume node representations $Z=\left\{z_{1}, z_{2}, \ldots, z_{n}\right\}$ are given as an output of the last hidden layer of a graph neural network on graph $G$ with n nodes. Given labeled data $\left\{\left(x_{i}, y_{i}\right)\right\}$ of size $M$ , the labeled node representation $Z_{l}=\left(z_{1}, \ldots, z_{m}\right)$ is a subset of the nodes that are labeled, $Z_{l} \subset Z$ . Assume $Z$ and $Z_{l}$ are drawn from two probability distributions $p$ and $q$. The distribution shift in GNNs is then measured via a distance metric $d\left(Z, Z_{l}\right)$

　　Figure 1 It shows that the influence of distribution shift caused by sample deviation directly reduces the performance of the model . By using nodes GCN The model plots the distributed displacement distance values of three data sets ( $x$ Axis ) And the corresponding model accuracy ( $y$ Axis ) The relationship between .

　　 It turns out that ,GNN The performance of node classification on these datasets is inversely proportional to the size of the distribution displacement , And inspired us to study the distributed displacement .

4 Shift-Robust Graph Neural Networks

　　 This section begins with two types of GNN The model solves the problem of distributed displacement （$\operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$, Then a general framework is proposed to reduce the distributed displacement 2 problem .

　　ben

4.1 Scenario 1: Traditional GNN models

　　 Tradition GNN Model (GCN) $\Phi$ contain Learnable functions $\mathbf{F}$ , Parameters $\Theta$ , Adjacency matrix $A$ :

　　　　$\Phi=\mathbf{F}(\Theta, A)$

　　 stay GCN in , The inductive deviation of a graph is multiplicative at each level , And the gradient propagates back in all layers . The nodes generated by the last layer are represented as ：

　　　　$Z \equiv Z_{k}=\Phi\left(\Theta, Z_{k-1}, A\right)$, $Z_{k} \in[a, b]^{n}$, $Z_{0}=X$

　　 The training sample $\left\{x_{i}\right\}_{i=1}^{M}$ The node of is represented as $Z_{\text {train }}=\left\{z_{i}\right\}_{i=1}^{M}$. For test samples , Extract an unbiased from unlabeled data IID Sample set $X_{\text {IID }}=\left\{x_{i}^{\prime}\right\}_{i=1}^{M}$, And express the output as $Z_{\text {IID }}=\left\{z_{i}^{\prime}\right\}_{i=1}^{M}$.

　　 To lighten the training and Distribution displacement between test samples , This paper presents a regularizer $d:[a, b]^{n} \times[a, b]^{n} \rightarrow \mathbb{R}^{+}$ Used to add to the cross entropy loss . because $\Phi$ It 's completely differentiable , The distributed displacement metric can be used as the regularization , To directly minimize biased and unbiased IID Differences between samples ：

　　　　$\mathcal{L}=\frac{1}{M} \sum_{i} l\left(y_{i}, z_{i}\right)+\lambda \cdot d\left(Z_{\text {train }}, Z_{\text {IID }}\right)$

　　 Here, the measurement of the distribution displacement is Central moment difference regularizer （central moment discrepancy regularizer）$d_{\mathrm{CMD}}$：

　　　　$d_{\mathrm{CMD}}\left(Z_{\text {train }}, Z_{\mathrm{IID}}\right)=\frac{1}{b-a}\left\|\mathbf{E}\left(Z_{\text {train }}\right)-\mathbf{E}\left(Z_{\mathrm{IID}}\right)\right\|+\sum\limits_{k=2}^{\infty} \frac{1}{|b-a|^{k}}\left\|c_{k}\left(Z_{\text {train }}\right)-c_{k}\left(Z_{\mathrm{IID}}\right)\right\|$

　　 among ,

- $\mathbf{E}(Z)=\frac{1}{M} \sum_{i} z_{i}$;
- $c_{k}(Z)=\mathbf{E}(Z-\mathbf{E}(Z))^{k}$ yes $k$ Moment of order center ;

4.2 Scenario 2: Linearized GNN Models

　　 Linearization GNN The model uses two different functions ： One for nonlinear feature transformation , The other is used in the linear graph expansion stage ：

　　　　$\Phi=\mathbf{F}_{\mathbf{2}}(\underbrace{\mathbf{F}_{\mathbf{1}}(\mathbf{A})}_{\text {linear function }}, \Theta, X)$

　　 among , Linear function $\mathbf{F}_{\mathbf{1}}$ Combine the graph induced deviation with the node characteristics , Then it is given to the feature coder of multilayer neural network $\mathbf{F}_{\mathbf{2}}$ decoupling .SimpleGCN[34] in $\mathbf{F}_{\mathbf{1}}(A)=A^{k} X$ . Another branch of linearized models [16,4,36] use personalized pagerank To pre calculate the information diffusion in the graph ( $\mathbf{F}_{\mathbf{1}}(A)=\alpha(I-(1-\alpha) \tilde{A})^{-1}$ ), And apply it to the coded node characteristics $F(\Theta, X)$.

　　 These two models , The graph induces the deviation as a linear function $\mathbf{F}_{\mathbf{1}}$ Feature input for . But there are no learnable layers in enough stages , Therefore, the proposed distributed regularizer cannot be simply used .

　　 In both models , The graph induces the deviation as linear $\mathbf{F}_{\mathbf{1}}$ The input feature of provides . Unfortunately , Because there is no layer to learn at this stage of these models , Therefore, we cannot simply apply the distributed regularizer proposed in the previous section .

　　 under these circumstances , Training and testing samples can be considered as coming from $\mathbf{F}_{\mathbf{1}}$ Row level sample of , Then the distribution displacement $\operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$ The problem is transformed into the inductive deviation feature space of matching training and test graphs $h_{i} \in \mathbb{R}^{n}$. In order to generalize from training data to test data , The sample weighting scheme can be used to correct the deviation , Such a biased training sample $\left\{h_{i}\right\}_{i=1}^{M}$ Will be similar to IID sample $ \left\{h_{i}^{\prime}\right\}_{i=1}^{M}$. The cross entropy loss thus obtained is

　　　　$\mathcal{L}=\frac{1}{M} \beta_{i} l\left(y_{i}, \Phi\left(h_{i}\right)\right)$

　　 among ,

- $\beta_{i}$ Is the weight of each training instance ;
- $l$ It's cross entropy loss ;

　　 then , By solving a Kernel mean matching (KMM)[9] To calculate the optimal $\beta$：

　　　　$\min _{\beta_{i}}\left\|\frac{1}{M} \sum\limits_{i=1}^{M} \beta_{i} \psi\left(h_{i}\right)-\frac{1}{M^{\prime}} \sum\limits_{i=1}^{M^{\prime}} \psi\left(h_{i}^{\prime}\right)\right\|^{2} \text {, s.t. } B_{l} \leq \beta<B_{u}$

　　$\psi: \mathbb{R}^{n} \rightarrow \mathcal{H}$ Means by the core $k$ Introduced reproducing kernel Hilbert space(RKHS) Feature mapping of . In the experiment , The author uses mixed Gaussian kernel function $k(x, y)=\sum_{\alpha_{i}} \exp \left(\alpha_{i}\|x-y\|_{2}\right)$, $\alpha_{i}=1,0.1,0.01 $. Lower limit $B_{l}$ And the upper limit $B_{u}$ The existence of constraints is to ensure that most samples obtain reasonable weights , Not just a few samples Get a non-zero weight .

　　 There are multiple classes in the actual label space . In order to prevent $\beta$ The resulting label imbalance , Further specific requirements $c$ Class $\beta$ The sum of Keep the same before and after correction $\sum_{i}^{M} \beta_{i} \cdot \mathbb{I}\left(l_{i}=c\right)=\sum_{i}^{M} \mathbb{I}\left(l_{i}=c\right), \forall c$ .

4.3 Shift-Robust GNN Framework

　　 Now we propose Shift-Robust GNN(SR-GNN)- We solve GNN The general training goal of distribution transfer in ：

　　　　$\mathcal{L}_{\text {SR-GNN }}=\frac{1}{M} \beta_{i} l\left(y_{i}, \Phi\left(x_{i}, A\right)\right)+\lambda \cdot d\left(Z_{\text {train }}, Z_{\text {IID }}\right)$

　　 The framework consists of a regularization component for dealing with distribution transfer in the learnable layer （ The first 4.1 section ） And an instance reweighting component , This component can handle the case that the graph induction deviation is added after the feature coding （ The first 4.2 section ）.

　　 Now? , We will discuss a concrete example of our framework , The example is applied to APPNP[16] Model .APPNP The definition of the model is ：

　　　　$\Phi_{\text {APPNP }}=\underbrace{\left((1-\alpha)^{k} \tilde{A}^{k}+\alpha \sum\limits_{i=0}^{k-1}(1-\alpha)^{i} \tilde{A}^{i}\right)}_{\text {approximated personalized page rank }} \underbrace{\mathbf{F}(\Theta, X)}_{\text {feature encoder }}$

　　 First, in the node feature $X$ Apply feature encoder on $\mathbf{F}$, And linear approximation personalized pagerank matrix. therefore , We have $h_{i}=\pi_{i}^{\mathrm{ppr}}$, among $\pi_{i}^{\mathrm{ppr}}$ Is a personalized page vector . So , We use instance weighting to reduce the distribution shift caused by graph induction bias . Besides , Give Way $Z=\mathbf{F}(\Theta, X)$ And we can further reduce the distributed displacement of the nonlinear network $d$. In our experiment , We showed SR-GNN In the other two representative GNN Application on the model ：GCN[15] and DGI[32].