Paper information
Paper title :Towards Explanation for Unsupervised Graph-Level Representation Learning
Author of the paper :Qinghua Zheng, Jihong Wang, Minnan Luo, Yaoliang Yu, Jundong Li, Lina Yao, Xiaojun Chang
Source of the paper :2022, arXiv
Address of thesis :download
Paper code :download
1 Introduction
Graph level representation of interpretability using information bottlenecks .
2 Notations and preliminaries
2.1 Information Bottleneck
Given input data $X$ Its label $Y$,Information Bottleneck The goal of is to discover a potential representation of compression $Z$, It uses $Y$ Provide maximum information . In form , We can learn about potential representations by optimizing the following optimization problems $Z$:
$\underset{Z}{max } \;\mathcal{L}_{I B}=I(Z ; Y)-\beta I(X ; Z)\quad\quad\quad(1)$
among ,$\beta$ It represents the trade-off between the amount of information and the amount of compression .
Mutual information (MI)I(X;Z) Measure the correlation between two random variables , Expressed as
$I(X ; Z)= \int_{x} \int_{z} p(x, z) \log \frac{p(x, z)}{p(x) p(z)} d x d z $
2.2 GNN explanation
GNN The purpose of the explanation of is to understand GNN The calculation process is crucial to the intrinsic information of the graph , So as to provide human understandable explanation . say concretely , Given a graph $G$ And a distribution of learning conditions $P_{\psi}(\hat{Z} \mid G), \mathrm{GNN}$ Of GNN Model $\psi$),GNN The purpose of explanation is to learn and GNN The most relevant explanation subgraph of the calculation results $S$, namely :
$\underset{S \in \mathcal{S}}{\text{arg max }} \operatorname{Score}(S, \hat{Z})\quad\quad\quad(2)$
among ,$\mathcal{S}$ Denotes by figure $G$ A set of all possible subgraphs of ;$\operatorname{Score}(S, \hat{Z})$ The subgraphs are measured $\mathcal{S}$ and GNN Calculated results of $\hat{Z}$ The correlation between .
for example ,GNNExcraner[9] Focus on Monitoring GNN The explanation of , And give relevant scores $\operatorname{Score}(S, \hat{Z})$ Formalized as mutual information , namely
$S=\arg \max _{S \in \mathcal{S}} I(S ; \hat{Y})$
among , A random variable $\hat{Y}=\hat{Z}$ Indicates the classification probability .
3 Method
3.1 Unsupervised Subgraph Information Bottleneck
In this paper , We study the unexplored interpretation problem of unsupervised graph level representation learning . Given an unsupervised GNN Extracted graph $G$ And its corresponding representation $Z$, Our goal is to identify the explanatory subgraphs that are most relevant to these representations $S$.
According to the previous explanation, it works [9,10], We use mutual information to measure correlation , Therefore, the explanation problem is expressed as $\underset{S}{\text{arg max }} I(S ; Z)$. Unfortunately , because $I(Z ; S) \leq I(Z ; G)$( The certificate is attached in Appendix B), So it has been proved that there exists a trivial solution $S=G$. Trivial solutions show , Explaining the subgraph age may contain redundant information , for example , Noise and irrelevant information represent $Z$ suffer $IB$ The successful interpretation of the principles oversees the network [19], We promote $IB$ In principle, there is no supervision , To avoid trivial solutions and take advantage of a new principle .
Definition. (Unsupervised Subgraph Information Bottleneck: USIB). Given a graph $G$ and its representation $Z$ , the USIB seeks for the most informative yet compressed explanation $S$ through optimization problem
$ \underset{S}{\text{max } }\mathcal{L}_{U S I B}=I(Z ; S)-\beta I(G ; S)\quad\quad\quad(3)$
By optimizing the USIB The goal is , We can make a trade-off between the information and compressibility of the explanatory subgraph . However , because USIB Optimization of objectives , Mutual information involves the integration of high-dimensional data , It's very difficult . therefore , Mutual information estimation method is needed .
3.2 Optimization for USIB
We are in USIB Two of the goals of $I(Z ; S)$ and $I(G ; S)$.
Maximizing $I(Z ; S)$
We use Jensen-Shannon MI estimator [32,33] for $I(Z;S)$ Assign an approximate lower bound , namely ,
$\hat{I}^{J S D}(Z ; S):=\sup _{f_{\phi}} \mathbb{E}_{p(S, Z)}\left[-s p\left(-f_{\phi}(S, Z)\right)\right]-\mathbb{E}_{p(S), p(Z)}\left[s p\left(f_{\phi}(S, Z)\right)\right]\quad\quad\quad(4)$
among $ s p(x)=\log \left(1+e^{x}\right)$ by softplus function; function $ f_{\phi}: \mathcal{S} \times \mathcal{Z} \rightarrow \mathbb{R}$ With learnable parameters $\phi $, To distinguish $S$ and $Z$ Whether instances of are sampled from the joint distribution . It is from $\mathrm{MLP}_{\phi_{1}}$ and $\mathrm{GNN}_{\phi_{2}}$ Function composition to achieve , namely :
$f_{\phi}\left(S^{(k)}, Z^{(k)}\right)=\operatorname{MLP}_{\phi_{1}}\left(\operatorname{GNN}_{\phi_{2}}\left(S^{(k)}\right) \| Z^{(k)}\right)\quad\quad\quad(5)$
among ,$\phi=\left\{\phi_{1}, \phi_{2}\right\}$;$\|$ Is the join operator . Please note that , Prior distribution $p(S, Z)$ and $p(Z)$ It is usually unreachable in practice . Combined with Monte Carlo sampling to approximate the prior distribution , We get an approximate lower bound $Eq.4$ from :
$\underset{\phi}{max} \mathcal{L}_{1}(\phi, S)=\frac{1}{K} \sum\limits_{k=1}^{K}-s p\left(-f_{\phi}\left(S^{(k)}, Z^{(k)}\right)\right)-\frac{1}{K} \sum\limits_{k=1, m \neq k}^{K} s p\left(f_{\phi}\left(S^{(k)}, Z^{(m)}\right)\right)\quad\quad\quad(6)$
among ,$K$ Is the number of samples .$\left(S^{(k)}, Z^{(k)}\right)$ From joint distribution $p(S, Z)$ In the sample ,$\left(S^{(k)}, Z^{(m)}\right)$ Respectively from the edge distribution $p(S)$ and $p(Z)$ Independent sampling in . In practice , We sample from the joint distribution by random arrangement $\left(S^{(k)}, Z^{(k)}\right)$ To sample $\left(S^{(k)}, Z^{(m)}\right)$.
Minimizing $\boldsymbol{I}(\boldsymbol{G} ; \boldsymbol{S}) $
Please note that , Explain the entropy of subgraphs $H(S)=\mathbb{E}_{p(S)}[-\log p(S)]$ by $I(G ; S)$ Provides an upper bound , Because of inequality $I(G ; S)=H(S)-H(S \mid G) \leq H(S)$ establish . However , Because in practice $S$ The prior distribution of is unknown , So it's hard to calculate entropy . To solve this problem , Let's consider a relaxation , And suppose that the explanatory graph is a Gilbert random graph (Gilbert random graph)[34], The edges are conditionally independent of each other . To be specific , Give Way $(i, j) \in \mathcal{E}$ Diagram $G$ The edge of ,$e_{i, j} \sim \operatorname{Bernoulli}\left(\mu_{i, j}\right)$ Is a binary variable indicating whether it is a subgraph $S$ Select edges $(i, j)$ . therefore , The probability decomposition of the subgraph is $p(S)=\prod\limits _{(i, j) \in \mathcal{E}} p\left(e_{i, j}\right)$, among $p\left(e_{i, j}\right)=\mu_{i, j}^{e_{i, j}}\left(1-\mu_{i, j}\right)^{1-e_{i, j}}$. such , We can use Monte Carlo sampling to get $I(G ; S)$ An approximate upper bound of , It is recorded as
$\mathcal{L}_{2}(S)=-\frac{1}{K} \sum\limits_{k=1}^{K} \sum\limits_{(i, j) \in \mathcal{E}} e_{i, j}^{(k)} \log \mu_{i, j}^{(k)}+\left(1-e_{i, j}^{(k)}\right) \log \left(1-\mu_{i, j}^{(k)}\right)\quad\quad\quad(7)$
Gradient based optimization methods may not be able to optimize $\text{Eq.6}$ and $\text{Eq.7}$ , Due to the discrete nature of non differentiable sampling process and subgraph structure . therefore , We followed Gumbel-Softmax reparametrization trick [35, 36] And the binary variable $e_{i, j}$ Relax to a continuous edge weight variable $\hat{e}_{i, j}=\sigma((\log \epsilon-\log (1-\epsilon)+ \left.\left.w_{i, j}\right) / \tau\right) \in[0,1]$, among $\sigma(\cdot)$ yes sigmoid function ;$\epsilon \sim \operatorname{Uniform}(0,1)$;$\tau$ It's a temperature super parameter , And there are $\lim _{\tau \rightarrow 0} p\left(\hat{e}_{i, j}=1\right)=\sigma\left(w_{i, j}\right)$;$w_{i, j}$ It is the potential variable calculated by the neural network according to the previous work :
$w_{i, j}^{(k)}=\operatorname{MLP}_{\theta_{1}}\left(\mathbf{z}_{i}^{(k)} \| \mathbf{z}_{j}^{(k)}\right) \text { with } \mathbf{z}_{i}^{(k)}=\operatorname{GNN}_{\theta_{2}}\left(G^{(k)}, i\right), i=1,2, \cdots\quad\quad\quad(8)$
among ,$\mathbf{z}_{i}^{(k)}$ Representation node $i$ The node representation of . To better represent , We mean $\theta= \left\{\theta_{1}, \theta_{2}\right\}$, And pass $\hat{S}^{(k)}=g_{\theta}\left(G^{(k)}\right)^{3}$ Generate relaxation subgraphs $\hat{S}$. set up $\mu_{i, j}^{(k)}=\sigma\left(w_{i, j}^{(k)}\right)$, In the equation $\text{Eq.7}$ Can be rewritten as
$\mathcal{L}_{2}\left(g_{\theta}\left(G^{(k)}\right)\right)=-\frac{1}{K} \sum\limits_{k=1}^{K} \sum\limits_{(i, j) \in \mathcal{E}} \hat{e}_{i, j}^{(k)} \log \sigma\left(w_{i, j}^{(k)}\right)+\left(1-\hat{e}_{i, j}^{(k)}\right) \log \left(1-\sigma\left(w_{i, j}^{(k)}\right)\right)\quad\quad\quad(9)$
All in all , We rewrote USIB optimization problem $\text{Eq.3}$ As :
$\underset{\phi, \theta}{\text{max }} \mathcal{L}_{U S I B}(\phi, \theta, G)=\mathcal{L}_{1}\left(\phi, g_{\theta}\left(G^{(k)}\right)\right)-\beta * \mathcal{L}_{2}\left(g_{\theta}\left(G^{(k)}\right)\right)\quad\quad\quad(10)$
An overview of our approach is shown in Fig. 2 Shown . Firstly, the explanation subgraph is generated by neural network , Then another network is used to estimate the mutual information between the interpretation subgraph and the graph representation . Last , The subgraph generator and mutual information estimator are optimized . The final explanatory subgraph can be selected to have top-n Edge weights $\left(\hat{e}_{i, j}^{(k)}\right)$ To achieve . Detailed algorithms can be found in the appendix .

3 Experiments
In this section , We empirically evaluate the effectiveness and superiority of our proposed method by answering the following questions .
- RQ1 How does our proposed method perform compared to other baseline explainers?
- RQ2 Does expressiveness and robustness of representations affect the fifidelity of explanatory subgraphs in agreement with the theoretical analysis?
3.1 Effectiveness of USIB


3.2 Inflfluence of representations’ expressiveness and robustness

3.3 Qualitative analysis

4 Conclusion
We study an unexplored problem of interpretation : Explanation of unsupervised graph representation learning . We proposed IB Principle to solve the problem of interpretation , Thus, a new interpretation method is produced USIB. Besides , We also theoretically analyze the relationship between representation and interpretation subgraphs in label space , It turns out that , Expressiveness and robustness help to explain the fidelity of subgraphs . Extensive results on four data sets and three target models demonstrate the superiority of our method and the effectiveness of theoretical analysis . As a future research direction , We consider the counterfactual interpretation of unsupervised representational learning [42], It also discusses the explanation and confrontational examples [43,44,45] Is there a connection between .
Revise history
2022-06-21 Create articles
Contents of thesis interpretation
reference
graph theory —— Random graph and random dot product graph






![[crawler notes 1] environment construction and necessary tools selenium](/img/58/e11951ce1b240fb4ac1398cb1a8b50.png)


