当前位置:网站首页>Sr-gnn shift robot gnns: overlapping the limitations of localized graph training data

Sr-gnn shift robot gnns: overlapping the limitations of localized graph training data

2022-06-24 18:45:00 Learner-

Paper information

Paper title :Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training Data
Author of the paper :Qi Zhu, Natalia Ponomareva, Jiawei Han, Bryan Perozzi
Source of the paper :2021, NeurIPS
Address of thesis :download 
Paper code :download

1 Introduction

   Semi supervised learning uses the relationship between data ( Namely, edge connection relationship , There will be inductive bias ), And a set of labeled samples , To predict the rest of the tag .

   The problems of semi supervised learning : The data distribution of training data set and test data set is inconsistent , Easy to produce Over fitting 、 The problem of poor generalization . When the data set is too small or too large , Select a part of the marked subset for training , This kind of problem is quite obvious .

   say concretely , Our contribution is as follows :

  1. We provide the first focused discussion on the distributional shift problem in GNNs.
  2. We propose generalized framework, Shift-Robust GNN (SR-GNN), which can address shift in both shallow and deep GNNs.
  3. We create an experimental framework which allows for creating biased train/test sets for graph learning datasets.
  4. We run extensive experiments and analyze the results, proving that our methods can mitigate distributional shift.

2 Related Work

   Standard learning theory assumes that training and reasoning data come from the same distribution , But in many practical cases , This is not true . In transfer learning , Domain adaptation (Domain adaptation) The problem involves removing knowledge from the source domain ( For learning ) Move to target domain ( The final distribution of reasoning ).

  [3]  As a pioneering work in this field, a model-based approach in   Source domain   and   Target domain   The distance metric function is used to quantify the similarity between the two domains . To get the final model , An intuitive idea is to train the model based on the weighted combination of source data and target data , Where the weight is the quantization function of the domain distance .

3 Distributional shift in GNNs

  SSL classifier , The cross entropy loss function is usually used $l$:

    $\mathcal{L}=\frac{1}{M} \sum\limits_{i=1}^{M} l\left(y_{i}, z_{i}\right)$

   When training data and test data come from the same domain   $\operatorname{Pr}_{\text {train }}(X, Y)=\operatorname{Pr}_{\text {test }}(X, Y)$  when , The trained classifier performs well .

3.1 Data shift as representation shift

   Based on the basic assumptions of standard learning theory $\operatorname{Pr}_{\text {train }}(Y \mid Z)=\operatorname{Pr}_{\text {test }}(Y \mid Z)$, The main reason for the distributed displacement is that it represents the displacement , namely    

    $\operatorname{Pr}_{\text {train }}(Z, Y) \neq \operatorname{Pr}_{\text {test }}(Z, Y) \rightarrow \operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$

   This article focuses on the representation of training data sets and test data sets $Z$ Distribution transfer between .

   To measure this change , You can use MMD[8] or CMD[37] Etc .CMD Measure the distribution $\mathrm{p}$ and $\mathrm{q}$ Direct distance between , as follows :

    $\mathrm{CMD}=\frac{1}{|b-a|}\|\mathrm{E}(p)-\mathrm{E}(q)\|_{2}+\sum\limits _{k=2}^{\infty} \frac{1}{|b-a|^{k}}\left\|c_{k}(p)-c_{k}(q)\right\|_{2}$

   among

    • $c_{k}$ On behalf of the $k$ Moment of order center , Usually  $k=5$ ;
    • $a$、$b$ Represents the joint distribution support of these distributions ;

   The greater the value in the above formula, the greater the distance between the two fields .

   Defined in this article GNNs by   $H^{k}=\sigma\left(H^{k-1} \theta^{k}\right)$, Conventional GNNs by  $H^{k}=\sigma\left(\tilde{A} H^{k-1} \theta^{k}\right)$.

   Conventional GNNs Due to the use of normalized adjacency matrix , Lead to inductive bias , So that changed Represents the distribution of . So in semi supervised learning , because Graph induction and sampling feature vector offset , It is difficult to generate large performance interference with cheap training samples .

   In form , The analysis of the distributed displacement is as follows :

  Definition  3.1  (Distribution shift in GNNs). Assume node representations  $Z=\left\{z_{1}, z_{2}, \ldots, z_{n}\right\}$  are given as an output of the last hidden layer of a graph neural network on graph  $G$  with n nodes. Given labeled data  $\left\{\left(x_{i}, y_{i}\right)\right\}$  of size  $M$ , the labeled node representation  $Z_{l}=\left(z_{1}, \ldots, z_{m}\right)$  is a subset of the nodes that are labeled,  $Z_{l} \subset Z$ . Assume  $Z$  and  $Z_{l}$  are drawn from two probability distributions  $p$  and $q$. The distribution shift in GNNs is then measured via a distance metric  $d\left(Z, Z_{l}\right)$ 

  Figure 1 It shows that the influence of distribution shift caused by sample deviation directly reduces the performance of the model . By using nodes GCN The model plots the distributed displacement distance values of three data sets ( $x$ Axis ) And the corresponding model accuracy ( $y$ Axis ) The relationship between .

  

   It turns out that ,GNN The performance of node classification on these datasets is inversely proportional to the size of the distribution displacement , And inspired us to study the distributed displacement .

4 Shift-Robust Graph Neural Networks

   This section begins with two types of GNN The model solves the problem of distributed displacement ($\operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$, Then a general framework is proposed to reduce the distributed displacement 2 problem .

  ben

  ji

4.1 Scenario 1: Traditional GNN models

   Tradition GNN Model (GCN) $\Phi$ contain Learnable functions $\mathbf{F}$ , Parameters $\Theta$ , Adjacency matrix $A$ :

    $\Phi=\mathbf{F}(\Theta, A)$

   stay GCN in , The inductive deviation of a graph is multiplicative at each level , And the gradient propagates back in all layers . The nodes generated by the last layer are represented as :

    $Z \equiv Z_{k}=\Phi\left(\Theta, Z_{k-1}, A\right)$, $Z_{k} \in[a, b]^{n}$, $Z_{0}=X$

   The training sample $\left\{x_{i}\right\}_{i=1}^{M}$ The node of is represented as $Z_{\text {train }}=\left\{z_{i}\right\}_{i=1}^{M}$. For test samples , Extract an unbiased from unlabeled data IID Sample set $X_{\text {IID }}=\left\{x_{i}^{\prime}\right\}_{i=1}^{M}$, And express the output as $Z_{\text {IID }}=\left\{z_{i}^{\prime}\right\}_{i=1}^{M}$.

   To lighten the training and Distribution displacement between test samples , This paper presents a regularizer $d:[a, b]^{n} \times[a, b]^{n} \rightarrow \mathbb{R}^{+}$ Used to add to the cross entropy loss . because $\Phi$ It 's completely differentiable , The distributed displacement metric can be used as the regularization , To directly minimize biased and unbiased IID Differences between samples :

    $\mathcal{L}=\frac{1}{M} \sum_{i} l\left(y_{i}, z_{i}\right)+\lambda \cdot d\left(Z_{\text {train }}, Z_{\text {IID }}\right)$

   Here, the measurement of the distribution displacement is Central moment difference regularizer (central moment discrepancy regularizer)$d_{\mathrm{CMD}}$:

    $d_{\mathrm{CMD}}\left(Z_{\text {train }}, Z_{\mathrm{IID}}\right)=\frac{1}{b-a}\left\|\mathbf{E}\left(Z_{\text {train }}\right)-\mathbf{E}\left(Z_{\mathrm{IID}}\right)\right\|+\sum\limits_{k=2}^{\infty} \frac{1}{|b-a|^{k}}\left\|c_{k}\left(Z_{\text {train }}\right)-c_{k}\left(Z_{\mathrm{IID}}\right)\right\|$

   among ,

    • $\mathbf{E}(Z)=\frac{1}{M} \sum_{i} z_{i}$;
    • $c_{k}(Z)=\mathbf{E}(Z-\mathbf{E}(Z))^{k}$ yes $k$ Moment of order center ;

4.2 Scenario 2: Linearized GNN Models

   Linearization GNN The model uses two different functions : One for nonlinear feature transformation , The other is used in the linear graph expansion stage :

    $\Phi=\mathbf{F}_{\mathbf{2}}(\underbrace{\mathbf{F}_{\mathbf{1}}(\mathbf{A})}_{\text {linear function }}, \Theta, X)$

   among , Linear function $\mathbf{F}_{\mathbf{1}}$ Combine the graph induced deviation with the node characteristics , Then it is given to the feature coder of multilayer neural network $\mathbf{F}_{\mathbf{2}}$ decoupling .SimpleGCN[34] in $\mathbf{F}_{\mathbf{1}}(A)=A^{k} X$ . Another branch of linearized models [16,4,36] use personalized pagerank To pre calculate the information diffusion in the graph ( $\mathbf{F}_{\mathbf{1}}(A)=\alpha(I-(1-\alpha) \tilde{A})^{-1}$ ), And apply it to the coded node characteristics $F(\Theta, X)$.

   These two models , The graph induces the deviation as a linear function $\mathbf{F}_{\mathbf{1}}$  Feature input for . But there are no learnable layers in enough stages , Therefore, the proposed distributed regularizer cannot be simply used .

   In both models , The graph induces the deviation as linear $\mathbf{F}_{\mathbf{1}}$ The input feature of provides . Unfortunately , Because there is no layer to learn at this stage of these models , Therefore, we cannot simply apply the distributed regularizer proposed in the previous section .

   under these circumstances , Training and testing samples can be considered as coming from $\mathbf{F}_{\mathbf{1}}$ Row level sample of , Then the distribution displacement $\operatorname{Pr}_{\text {train }}(Z) \neq \operatorname{Pr}_{\text {test }}(Z)$ The problem is transformed into the inductive deviation feature space of matching training and test graphs $h_{i} \in \mathbb{R}^{n}$. In order to generalize from training data to test data , The sample weighting scheme can be used to correct the deviation , Such a biased training sample $\left\{h_{i}\right\}_{i=1}^{M}$ Will be similar to IID sample $ \left\{h_{i}^{\prime}\right\}_{i=1}^{M}$. The cross entropy loss thus obtained is

    $\mathcal{L}=\frac{1}{M} \beta_{i} l\left(y_{i}, \Phi\left(h_{i}\right)\right)$

   among ,

    • $\beta_{i}$ Is the weight of each training instance ;
    • $l$ It's cross entropy loss ;

   then , By solving a Kernel mean matching (KMM)[9] To calculate the optimal $\beta$:

    $\min _{\beta_{i}}\left\|\frac{1}{M} \sum\limits_{i=1}^{M} \beta_{i} \psi\left(h_{i}\right)-\frac{1}{M^{\prime}} \sum\limits_{i=1}^{M^{\prime}} \psi\left(h_{i}^{\prime}\right)\right\|^{2} \text {, s.t. } B_{l} \leq \beta<B_{u}$

  $\psi: \mathbb{R}^{n} \rightarrow \mathcal{H}$ Means by the core $k$ Introduced reproducing kernel Hilbert space(RKHS) Feature mapping of . In the experiment , The author uses mixed Gaussian kernel function $k(x, y)=\sum_{\alpha_{i}} \exp \left(\alpha_{i}\|x-y\|_{2}\right)$, $\alpha_{i}=1,0.1,0.01 $. Lower limit $B_{l}$ And the upper limit $B_{u}$ The existence of constraints is to ensure that most samples obtain reasonable weights , Not just a few samples Get a non-zero weight .

   There are multiple classes in the actual label space . In order to prevent $\beta$ The resulting label imbalance , Further specific requirements $c$ Class $\beta$ The sum of Keep the same before and after correction $\sum_{i}^{M} \beta_{i} \cdot \mathbb{I}\left(l_{i}=c\right)=\sum_{i}^{M} \mathbb{I}\left(l_{i}=c\right), \forall c$ .

4.3 Shift-Robust GNN Framework

   Now we propose Shift-Robust GNN(SR-GNN)- We solve GNN The general training goal of distribution transfer in :

    $\mathcal{L}_{\text {SR-GNN }}=\frac{1}{M} \beta_{i} l\left(y_{i}, \Phi\left(x_{i}, A\right)\right)+\lambda \cdot d\left(Z_{\text {train }}, Z_{\text {IID }}\right)$

   The framework consists of a regularization component for dealing with distribution transfer in the learnable layer ( The first 4.1 section ) And an instance reweighting component , This component can handle the case that the graph induction deviation is added after the feature coding ( The first 4.2 section ).

   Now? , We will discuss a concrete example of our framework , The example is applied to APPNP[16] Model .APPNP The definition of the model is :

    $\Phi_{\text {APPNP }}=\underbrace{\left((1-\alpha)^{k} \tilde{A}^{k}+\alpha \sum\limits_{i=0}^{k-1}(1-\alpha)^{i} \tilde{A}^{i}\right)}_{\text {approximated personalized page rank }} \underbrace{\mathbf{F}(\Theta, X)}_{\text {feature encoder }}$

   First, in the node feature $X$ Apply feature encoder on $\mathbf{F}$, And linear approximation personalized pagerank matrix. therefore , We have $h_{i}=\pi_{i}^{\mathrm{ppr}}$, among $\pi_{i}^{\mathrm{ppr}}$ Is a personalized page vector . So , We use instance weighting to reduce the distribution shift caused by graph induction bias . Besides , Give Way $Z=\mathbf{F}(\Theta, X)$ And we can further reduce the distributed displacement of the nonlinear network $d$. In our experiment , We showed SR-GNN In the other two representative GNN Application on the model :GCN[15] and DGI[32].

5 Experiments

experiment

  

  

  

 

5 Conclusion

   For semi supervised learning , Consider the representation distribution consistency problem .

Revise history

2022-06-24  Create articles

Contents of thesis interpretation

 

原网站

版权声明
本文为[Learner-]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241556002171.html