当前位置:网站首页>Fisher信息量检测对抗样本代码详解
Fisher信息量检测对抗样本代码详解
2022-06-21 16:01:00 【鬼道2022】
1 引言
在上一篇《Fisher信息量在对抗样本中的应用》中详尽地阐述了Fisher信息量在对抗攻击,防御,以及检测中的应用,并解析了三篇具有代表性的论文。Fisher信息量是可以用来去挖掘深度学习模型对抗行为的深层原因的非常好用一个数学工具。本文主要基于用Fisher信息量去检测对抗样本的一篇论文《Inspecting adversarial examples using the Fisher information》的代码进行深度解析,该论文提出了三个指标对对抗样本进行检测分别是Fisher信息矩阵迹,Fisher信息二次型和Fisher信息敏感度。本文会对论文中直接给出的结果的中间证明过程进行补充,而且代码中一些重要的关键细节也会在对应的章节中有所说明。
2 Fisher信息矩阵迹
给定输入样本 x x x,神经网路的输出一个 C C C维概率向量 f θ ( x ) = ( f θ c ( x ) ) c = 1 , ⋯ , C f_\theta(x)=(f^c_\theta(x))_{c=1,\cdots,C} fθ(x)=(fθc(x))c=1,⋯,C,则关于神经网络参数 θ \theta θ的Fisher信息矩阵的连续形式和离散形式如下所示:
F θ = E y [ ∇ θ log f ( x ; θ ) ∇ θ ⊤ log f ( x ; θ ) ] = ∫ ∇ θ log f ( x ; θ ) ∇ θ ⊤ log f ( x ; θ ) d y ≈ ∑ c = 1 C f θ c ( x ) ⋅ ∇ θ log f θ c ( x ) ∇ ⊤ log f θ c ( x ) = ∑ c = 1 C ∇ θ f θ c ( x ) ∇ θ ⊤ log f θ c ( x ) \begin{aligned}\mathbb{F}_\theta&=\mathbb{E}_y[\nabla_\theta \log f(x;\theta)\nabla_\theta^\top \log f(x;\theta)]\\&=\int\nabla_\theta \log f(x;\theta) \nabla_\theta^{\top} \log f(x;\theta)dy\\&\approx \sum\limits_{c=1}^Cf_\theta^c(x)\cdot \nabla_\theta \log f^c_\theta(x) \nabla^{\top}\log f^c_\theta(x)\\&=\sum\limits_{c=1}^C \nabla_\theta f_\theta^c(x) \nabla_\theta^\top \log f_\theta^c(x)\end{aligned} Fθ=Ey[∇θlogf(x;θ)∇θ⊤logf(x;θ)]=∫∇θlogf(x;θ)∇θ⊤logf(x;θ)dy≈c=1∑Cfθc(x)⋅∇θlogfθc(x)∇⊤logfθc(x)=c=1∑C∇θfθc(x)∇θ⊤logfθc(x)其中可知 ∇ θ log f ( x ; θ ) ∈ R p × C \nabla_\theta \log f(x;\theta)\in\mathbb{R}^{p\times C} ∇θlogf(x;θ)∈Rp×C, ∇ θ log f θ c ( x ) ∈ R p × 1 \nabla_\theta \log f^c_\theta(x)\in \mathbb{R}^{p\times 1} ∇θlogfθc(x)∈Rp×1, F θ ∈ R p × p \mathbb{F}_\theta\in \mathbb{R}^{p\times p} Fθ∈Rp×p。需要注意的是,计算一个非常小规模的神经网络,Fisher信息矩阵的计算量 O ( p 2 ) O(p^2) O(p2)也是棘手的,更何况是那些动辄就上亿的参数量规模的神经网络,计算量更加庞大。因为原论文目的是只关注检测对抗样本,不需要详细计算Fisher信息矩阵中每个精确值,给定样本Fisher信息量的一个取值范围即可作为检测的指标,所以论文中采用Fisher信息矩阵的迹作为检测指标,具体计算公式如下所示
t r ( F θ ) = ∑ i = 1 p ∑ c = 1 C ∂ θ i f θ c ( x ) ∂ θ i log f θ c ( x ) \mathrm{tr} (\mathbb{F}_\theta)=\sum\limits_{i=1}^p\sum\limits_{c=1}^C \partial_{\theta_i} f^c_\theta(x)\partial_{\theta_i}\log f^c_\theta(x) tr(Fθ)=i=1∑pc=1∑C∂θifθc(x)∂θilogfθc(x)要知道理论分析和实际编程总会有一些出入,在以上公式推导中,是将神经网络里的所有权重参数当成一个一维参数向量来考虑,但实际编程中时,神经网络的参数是按层排序的,不过当在求解Fisher信息量的时候,这两种情况时一致的。假设有一个四隐层的神经网络,参数分别是 θ 1 , θ 2 , θ 3 , θ 4 ∈ R l × 1 \theta_1,\theta_2,\theta_3,\theta_4 \in \mathbb{R}^{l\times 1} θ1,θ2,θ3,θ4∈Rl×1,则对应的参数和梯度如下所示
θ = ( θ 1 θ 2 θ 3 θ 3 ) ∈ R p × 1 , ∇ θ f θ c ( x ) = ( ∇ θ 1 f θ c ( x ) ∇ θ 2 f θ c ( x ) ∇ θ 3 f θ c ( x ) ∇ θ 4 f θ c ( x ) ) ∈ R p × 1 \theta=\left(\begin{array}{c}\theta_1\\\theta_2\\\theta_3\\\theta_3\end{array}\right)\in\mathbb{R}^{p\times1},\quad \nabla_{\theta}f_\theta^c(x)=\left(\begin{array}{c}\nabla_{\theta_1}f_{\theta}^c (x)\\\nabla_{\theta_2} f_{\theta}^c(x)\\\nabla_{\theta_3}f^c_{\theta}(x)\\\nabla_{\theta_4}f^c_{\theta}(x)\end{array}\right)\in\mathbb{R}^{p\times 1} θ=⎝⎜⎜⎛θ1θ2θ3θ3⎠⎟⎟⎞∈Rp×1,∇θfθc(x)=⎝⎜⎜⎛∇θ1fθc(x)∇θ2fθc(x)∇θ3fθc(x)∇θ4fθc(x)⎠⎟⎟⎞∈Rp×1进一步可知两种情况下Fisher信息矩阵的迹相等 t r ( F θ ) = ∑ c = 1 C ∇ θ ⊤ f θ c ( x ) ∇ θ f θ c ( x ) = ∇ θ 1 ⊤ f θ c ( x ) ∇ θ 1 f θ c ( x ) + ∇ θ 2 ⊤ f θ c ( x ) ∇ θ 2 f θ c ( x ) + ∇ θ 3 ⊤ f θ c ( x ) ∇ θ 3 f θ c ( x ) + ∇ θ 4 ⊤ f θ c ( x ) ∇ θ 4 f θ c ( x ) = ∑ i = 1 p ∑ c = 1 C ∂ θ i f θ c ( x ) ∂ θ i log f θ c ( x ) \begin{aligned}\mathrm{tr} (\mathbb{F}_\theta)&=\sum\limits_{c=1}^C \nabla_\theta^{\top}f^c_\theta(x)\nabla_\theta f^c_\theta(x)\\&=\nabla_{\theta_1}^{\top}f^c_{\theta}(x) \nabla_{\theta_1} f_{\theta}^c(x)+\nabla_{\theta_2}^{\top}f^c_{\theta}(x) \nabla_{\theta_2} f_{\theta}^c(x)+\nabla_{\theta_3}^{\top}f^c_{\theta}(x) \nabla_{\theta_3} f_{\theta}^c(x)+\nabla_{\theta_4}^{\top}f^c_{\theta}(x) \nabla_{\theta_4} f_{\theta}^c(x)\\&=\sum\limits_{i=1}^p\sum\limits_{c=1}^C \partial_{\theta_i} f^c_\theta(x)\partial_{\theta_i}\log f^c_\theta(x)\end{aligned} tr(Fθ)=c=1∑C∇θ⊤fθc(x)∇θfθc(x)=∇θ1⊤fθc(x)∇θ1fθc(x)+∇θ2⊤fθc(x)∇θ2fθc(x)+∇θ3⊤fθc(x)∇θ3fθc(x)+∇θ4⊤fθc(x)∇θ4fθc(x)=i=1∑pc=1∑C∂θifθc(x)∂θilogfθc(x)此时可以发现使用反向传播计算Fisher信息矩阵的迹的计算量为 O ( C ⋅ p ) O(C\cdot p) O(C⋅p),要远远小于计算Fisher信息矩阵的计算量 O ( p 2 ) O(p^2) O(p2)。
3 Fisher信息二次型
矩阵 F θ \mathbb{F}_\theta Fθ的迹可以写成 ∑ i = 1 p e i ⊤ F θ e i \sum\limits_{i=1}^pe^{\top}_i \mathbb{F}_\theta e_i i=1∑pei⊤Fθei,其中 e i e_i ei为单位向量,即第 i i i个元素为 1 1 1,其余元素为 0 0 0,这可以理解为 K L \mathrm{KL} KL散度对每个参数变化的平均值。受此启发,作者可以选择一个特定的方向和度量,而不是在完全正交的基础上求平均值,即有如下二次型 v ⊤ F θ v = ∑ c = 1 C v ⊤ ∇ θ f θ c ( x ) ⋅ v ⊤ ∇ θ log f θ c ( x ) v^{\top}\mathbb{F}_\theta v =\sum\limits_{c=1}^C v^{\top}\nabla_\theta f_\theta^c(x) \cdot v^{\top}\nabla_\theta \log f_\theta^c(x) v⊤Fθv=c=1∑Cv⊤∇θfθc(x)⋅v⊤∇θlogfθc(x)其中给定的向量 v v v与参数 θ \theta θ和数据点 ( x , y ) (x,y) (x,y)有关 v = λ ⋅ ∇ θ log p ( y ∣ x ; θ ) v = \lambda \cdot \nabla_\theta \log p(y|x;\theta) v=λ⋅∇θlogp(y∣x;θ)当对 v v v进行归一化时,则有如下二次型 v ˉ ⊤ F θ v ˉ = ∑ c = 1 C v ⊤ ∥ v ∥ ∇ θ f θ c ( x ) v ⊤ ∥ v ∥ ∇ θ log f θ c ( x ) \bar{v}^{\top}\mathbb{F}_\theta \bar{v} =\sum\limits_{c=1}^C\frac{v^{\top}}{\|v\|}\nabla_\theta f_\theta^c(x)\frac{v^{\top}}{\|v\|}\nabla_\theta \log f_\theta^c(x) vˉ⊤Fθvˉ=c=1∑C∥v∥v⊤∇θfθc(x)∥v∥v⊤∇θlogfθc(x)这里需要注意的是选取的方向并不唯一,如果想让二次型的取值达到最大,则是Fisher矩阵的最大特征值,选取的方向为在最大特征值对应的特征向量。需要指明一点的是,Fisher矩阵的迹要大于Fisher矩阵的最大特征值,具体证明如下所示 t r ( Λ ) = t r ( Q F θ Q − 1 ) = t r ( F θ Q − 1 Q ) = t r ( F θ ) \mathrm{tr}(\Lambda)=\mathrm{tr}(Q\mathbb{F}_\theta Q^{-1})=\mathrm{tr(\mathbb{F}_\theta Q^{-1}Q)}=\mathrm{tr}(\mathbb{F}_\theta) tr(Λ)=tr(QFθQ−1)=tr(FθQ−1Q)=tr(Fθ)其中 Λ \Lambda Λ为矩阵 F θ \mathbb{F}_\theta Fθ的特征对角矩阵, Q Q Q为单位正交矩阵。在具体的实际编程中,为了简化计算,会利用有限差分计算来估计反向传播求梯度的结果,由泰勒公式可知 f θ + ε v c ( x ) = f θ c ( x ) + ε v ⊤ ∇ θ f θ c ( x ) + O ( ∥ ε v ∥ 2 ) \begin{aligned}f^c_{\theta+\varepsilon v}(x)=f^c_\theta(x)+\varepsilon v^{\top}\nabla_\theta f_\theta^c(x)+\mathcal{O}(\|\varepsilon v\|^2)\end{aligned} fθ+εvc(x)=fθc(x)+εv⊤∇θfθc(x)+O(∥εv∥2)进而则有 v ⊤ ∇ θ f θ c ( x ) ≈ f θ + ε v c ( x ) − f θ c ( x ) ε ≈ f θ + ε v c ( x ) − f θ − ε v c ( x ) 2 ε v^{\top}\nabla_\theta f^c_\theta(x)\approx \frac{f^c_{\theta+\varepsilon v}(x)-f^c_{\theta}(x)}{\varepsilon}\approx \frac{f^c_{\theta+\varepsilon v}(x)-f^c_{\theta-\varepsilon v}(x)}{2 \varepsilon} v⊤∇θfθc(x)≈εfθ+εvc(x)−fθc(x)≈2εfθ+εvc(x)−fθ−εvc(x)
4 Fisher信息敏感度
为了进一步获得可利用的Fisher信息量,作者在输入样本中随机引入一个单随机变量 ξ ∈ N ( 0 , 1 ) \xi \in \mathcal{N}(0,1) ξ∈N(0,1),即有
x ε , η = x + ε ξ ⋅ η x^{\varepsilon,\eta}=x+\varepsilon \xi\cdot \eta xε,η=x+εξ⋅η其中 ε > 0 \varepsilon>0 ε>0,并且 η \eta η与 x x x有相同的维度。对于这个被扰动的输入 x ε , η x^{\varepsilon,\eta} xε,η,对其Fisher信息矩阵为
F θ ε , η = ∑ c = 1 C E x ε , η [ ∇ θ f θ c ( x ε , η ) ∇ θ ⊤ log f θ c ( x ε , η ) ] \mathbb{F}_\theta^{\varepsilon,\eta}=\sum\limits_{c=1}^C\mathbb{E}_{x^{\varepsilon,\eta}}\left[\nabla_\theta f_\theta^c(x^{\varepsilon,\eta})\nabla^{\top}_\theta \log f_\theta^c(x^{\varepsilon,\eta})\right] Fθε,η=c=1∑CExε,η[∇θfθc(xε,η)∇θ⊤logfθc(xε,η)]其中 F ε , η ∈ R p × p \mathbb{F}^{\varepsilon,\eta}\in\mathbb{R}^{p\times p} Fε,η∈Rp×p,该矩阵的第 i i i行,第 j j j列的元素可以表示为 [ F θ ε , η ] ( i , j ) = ∑ c = 1 C E x ε , η [ H ( i , j ) c ( x ) ] = ∑ c = 1 C E x ε , η [ ∂ θ i f θ c ( x + ε ξ η ) ⋅ ∂ θ j log f θ c ( x + ε ξ η ) ] \left[\mathbb{F}_\theta^{\varepsilon,\eta}\right]_{(i,j)}=\sum\limits_{c=1}^C\mathbb{E}_{x^{\varepsilon,\eta}}[H^c_{(i,j)}(x)]=\sum\limits_{c=1}^C\mathbb{E}_{x^{\varepsilon,\eta}}\left[\partial_{\theta_i}f_{\theta}^c(x+\varepsilon \xi \eta)\cdot \partial_{\theta_j} \log f^c_{\theta}(x+\varepsilon \xi \eta)\right] [Fθε,η](i,j)=c=1∑CExε,η[H(i,j)c(x)]=c=1∑CExε,η[∂θifθc(x+εξη)⋅∂θjlogfθc(x+εξη)]又因为 F θ ∈ R p × p \mathbb{F}_\theta \in \mathbb{R}^{p\times p} Fθ∈Rp×p的第 i i i行,第 j j j列的元素为 [ F θ ] ( i , j ) = ∑ c = 1 C E x ε , η [ G ( i , j ) c ( x ) ] = ∑ c = 1 C E x ε , η [ ∂ θ i f θ c ( x ) ∂ θ j log f θ c ( x ) ] \left[\mathbb{F}_\theta\right]_{(i,j)}=\sum\limits_{c=1}^C\mathbb{E}_{x^{\varepsilon,\eta}}[G_{(i,j)}^c(x)]=\sum\limits_{c=1}^C\mathbb{E}_{x^{\varepsilon,\eta}}\left[\partial_{\theta_i} f^c_\theta(x)\partial_{\theta_j}\log f_{\theta}^c(x)\right] [Fθ](i,j)=c=1∑CExε,η[G(i,j)c(x)]=c=1∑CExε,η[∂θifθc(x)∂θjlogfθc(x)]则有泰勒展开式可知 H ( i , j ) c ( x ) = ∂ θ i f θ c ( x + ε ξ η ) ⋅ ∂ θ j log f θ c ( x + ε ξ η ) = ∂ θ i f θ c ( x ) ⋅ ∂ θ j f θ c ( x ) + ε ξ η ⊤ ∇ x G ( i , j ) c ( x ) + 1 2 ε 2 ξ 2 η ⊤ ∇ x ⊤ ∇ x G ( i , j ) c ( x ) η + O ( ε 3 ) \begin{aligned}H^c_{(i,j)}(x)&=\partial_{\theta_i} f_{\theta}^c(x+\varepsilon \xi \eta)\cdot \partial_{\theta_j}\log f^c_{\theta}(x+\varepsilon \xi \eta)\\&=\partial_{\theta_i}f_{\theta}^c(x)\cdot \partial_{\theta_j}f_\theta^c(x)+\varepsilon \xi \eta^{\top}\nabla_x G^c_{(i,j)}(x)+\frac{1}{2}\varepsilon^2 \xi^2\eta^{\top}\nabla_x^{\top}\nabla_xG^c_{(i,j)}(x)\eta+\mathcal{O}(\varepsilon^3)\end{aligned} H(i,j)c(x)=∂θifθc(x+εξη)⋅∂θjlogfθc(x+εξη)=∂θifθc(x)⋅∂θjfθc(x)+εξη⊤∇xG(i,j)c(x)+21ε2ξ2η⊤∇x⊤∇xG(i,j)c(x)η+O(ε3)其中上公式的第二项 H e s s i a n \mathrm{Hessian} Hessian矩阵可以表示为 [ ∇ x ⊤ ∇ x G c ( x ) ] ( m , n ) = ∂ x m ∂ x n ∂ θ i f θ c ( x ) ∂ θ j log f θ c ( x ) [\nabla_x^{\top}\nabla_xG^c(x)]_{(m,n)}=\partial_{x_m} \partial_{x_n}\partial _{\theta_i} f_\theta^c(x)\partial_{\theta_j} \log f_\theta^c(x) [∇x⊤∇xGc(x)](m,n)=∂xm∂xn∂θifθc(x)∂θjlogfθc(x)又因为 ξ \xi ξ是一个均值为 0 0 0,方差为 1 1 1的随机变量,进而则有 E [ ξ ] = 0 , E [ ξ 2 ] = V a r [ ξ ] + ( E [ ξ ] ) 2 = 1 \mathbb{E}[\xi]=0,\quad \mathbb{E}[\xi^2]=\mathrm{Var}[\xi]+(\mathbb{E}[\xi])^2=1 E[ξ]=0,E[ξ2]=Var[ξ]+(E[ξ])2=1综合以上推导结果,则有 [ F θ ε , η ] ( i , j ) = ∑ c = 1 C E x ε , η [ H ( i , j ) c ( x ) ] = ∑ c = 1 C E x ε , η [ G ( i , j ) c ( x ) ] + ε E [ ξ ] ∑ c = 1 C η ⊤ ∇ x G ( i , j ) c ( x ) + 1 2 ε 2 E [ ξ 2 ] ∑ c = 1 C η ⊤ ∇ x ⊤ ∇ x G ( i , j ) c ( x ) η + O ( ε 3 ) \begin{aligned}[\mathbb{F}^{\varepsilon,\eta}_{\theta}]_{(i,j)}&=\sum\limits_{c=1}^C \mathbb{E}_{x^{\varepsilon,\eta}}[H^c_{(i,j)}(x)]\\&=\sum\limits_{c=1}^C\mathbb{E}_{x^{\varepsilon,\eta}}[G_{(i,j)}^c(x)]+\varepsilon \mathbb{E}[\xi]\sum\limits_{c=1}^C\eta^{\top}\nabla_x G^c_{(i,j)}(x)+\frac{1}{2}\varepsilon^2\mathbb{E}[\xi^2]\sum\limits_{c=1}^C\eta^{\top}\nabla_x^{\top}\nabla_x G^{c}_{(i,j)}(x)\eta+\mathcal{O}(\varepsilon^3)\end{aligned} [Fθε,η](i,j)=c=1∑CExε,η[H(i,j)c(x)]=c=1∑CExε,η[G(i,j)c(x)]+εE[ξ]c=1∑Cη⊤∇xG(i,j)c(x)+21ε2E[ξ2]c=1∑Cη⊤∇x⊤∇xG(i,j)c(x)η+O(ε3)最后可以得到与论文中相同的结果 F θ ε , η = F θ + 0 + 1 2 ε 2 ∑ c = 1 C ∑ i , j = 1 N η i ∇ θ ∂ x i f θ c ( x ) ∇ θ ⊤ ∂ x j log f θ ( x ) η j + O ( ε 3 ) \mathbb{F}^{\varepsilon,\eta}_\theta=\mathbb{F}_\theta+0+\frac{1}{2}\varepsilon^2 \sum\limits_{c=1}^C\sum\limits_{i,j=1}^N\eta_i \nabla_\theta \partial_{x_i}f_\theta^c(x)\nabla^{\top}_\theta \partial_{x_j}\log f_\theta(x)\eta_j + \mathcal{O}(\varepsilon^3) Fθε,η=Fθ+0+21ε2c=1∑Ci,j=1∑Nηi∇θ∂xifθc(x)∇θ⊤∂xjlogfθ(x)ηj+O(ε3)与上一节求Fisher矩阵二次型一样,作者也对扰动样本 x ε , η x^{\varepsilon,\eta} xε,η的Fisher矩阵求二次型,则有 v ⊤ F θ ε , η v = v ⊤ F θ v + 1 2 ε 2 η ⊤ δ v F θ η v^{\top}\mathbb{F}_\theta^{\varepsilon,\eta}v=v^{\top}\mathbb{F}_\theta v+\frac{1}{2}\varepsilon^2 \eta^{\top} \delta_v \mathbb{F}_\theta \eta v⊤Fθε,ηv=v⊤Fθv+21ε2η⊤δvFθη其中 δ v F θ = ∑ c = 1 C ∇ x ( v ⊤ ∇ θ f θ c ( x ) ) ⋅ ∇ x ⊤ ( v ⊤ ∇ θ log f θ c ( x ) ) \delta_v\mathbb{F}_\theta = \sum\limits_{c=1}^C \nabla_x (v^{\top}\nabla_\theta f_\theta^c(x))\cdot \nabla_x^{\top}(v^{\top} \nabla_\theta \log f^c_\theta(x)) δvFθ=c=1∑C∇x(v⊤∇θfθc(x))⋅∇x⊤(v⊤∇θlogfθc(x))假如给定的扰动向量 η \eta η是单位向量 e i e_i ei,即 ∀ i = 1 , ⋯ , η = e i \forall i =1,\cdots,\text{ }\eta=e_i ∀i=1,⋯, η=ei。在实际编程中利用有限差分来估计反反向传播求梯度的结果,进而则有 e i ⊤ δ v F θ e i = ∑ c = 1 C ∂ x i ( v ⊤ ∇ θ f θ c ( x ) ) ⋅ ∂ x i ( v ⊤ ∇ θ log f θ c ( x ) ) = ∑ c = 1 C ( v ⊤ ∇ θ ∂ x i f θ c ( x ) ) ⋅ ( v ⊤ ∇ θ ∂ x i log f θ c ( x ) ) ≈ ∑ c = 1 C ( ∂ x i f θ + ε ⋅ v c ( x ) − ∂ x i f θ c ( x ) ) ε ⋅ ( ∂ x i log f θ + ε v c ( x ) − ∂ x i log f θ c ( x ) ) ε \begin{aligned}e_i^{\top}\delta_v \mathbb{F}_\theta e_i &=\sum\limits_{c=1}^C \partial_{x_i}(v^{\top}\nabla_\theta f_\theta^c(x))\cdot \partial_{x_i}(v^{\top}\nabla_\theta \log f^c_\theta(x))\\&=\sum\limits_{c=1}^C (v^{\top}\nabla_\theta \partial_{x_i}f^c_\theta(x))\cdot(v^{\top}\nabla_\theta \partial_{x_i} \log f^c_\theta (x))\\ &\approx \sum\limits_{c=1}^C\frac{(\partial_{x_i} f_{\theta +\varepsilon\cdot v}^c(x)-\partial_{x_i} f_\theta^c(x))}{\varepsilon}\cdot \frac{(\partial_{x_i} \log f_{\theta+\varepsilon v}^c(x)-\partial_{x_i}\log f_{\theta}^c(x))}{\varepsilon}\end{aligned} ei⊤δvFθei=c=1∑C∂xi(v⊤∇θfθc(x))⋅∂xi(v⊤∇θlogfθc(x))=c=1∑C(v⊤∇θ∂xifθc(x))⋅(v⊤∇θ∂xilogfθc(x))≈c=1∑Cε(∂xifθ+ε⋅vc(x)−∂xifθc(x))⋅ε(∂xilogfθ+εvc(x)−∂xilogfθc(x))以上公式在论文中被称为Fisher信息敏感度(FIS),它主要用于评估第 i i i个输入节点的重要性。
5 代码示例
Fisher信息矩阵的迹,Fisher信息二次型以及Fisher信息敏感度的代码示例和实验结果如下所示,对应上文的原理介绍,可以更好的理解代码示例中相关原理的实现细节。
import torch
import torch.nn.functional as F
from copy import deepcopy
class FISHER_OPERATION(object):
def __init__(self, input_data, network, vector, epsilon = 1e-3):
self.input = input_data
self.network = network
self.vector = vector
self.epsilon = epsilon
# Computes the fisher matrix quadratic form along the specific vector
def fisher_quadratic_form(self):
fisher_sum = 0
## Computes the gradient of parameters of each layer
for i, parameter in enumerate(self.network.parameters()):
## Store the original parameters
store_data = deepcopy(parameter.data)
parameter.data += self.epsilon * self.vector[i]
log_softmax_output1 = self.network(self.input)
softmax_output1 = F.softmax(log_softmax_output1, dim=1)
parameter.data -= 2 * self.epsilon * self.vector[i]
log_softmax_output2 = self.network(self.input)
solfmax_output2 = F.softmax(log_softmax_output2, dim=1)
parameter.data = store_data
# The summation of finite difference approximate
fisher_sum += (((log_softmax_output1 - log_softmax_output2)/(2 * self.epsilon))*((softmax_output1 - solfmax_output2)/(2 * self.epsilon))).sum()
return fisher_sum
# Computes the fisher matrix trace
def fisher_trace(self):
fisher_trace = 0
output = self.network(self.input)
output_dim = output.shape[1]
parameters = self.network.parameters()
## Computes the gradient of parameters of each layer
for parameter in parameters:
for j in range(output_dim):
self.network.zero_grad()
log_softmax_output = self.network(self.input)
log_softmax_output[0,j].backward()
log_softmax_grad = parameter.grad
self.network.zero_grad()
softmax_output = F.softmax(self.network(self.input), dim=1)
softmax_output[0,j].backward()
softmax_grad = parameter.grad
fisher_trace += (log_softmax_grad * softmax_grad).sum()
return fisher_trace
# Computes fisher information sensitivity for x and v.
def fisher_sensitivity(self):
output = self.network(self.input)
output_dim = output.shape[1]
parameters = self.network.parameters()
x = deepcopy(self.input.data)
x.requires_grad = True
fisher_sum = 0
for i, parameter in enumerate(parameters):
for j in range(output_dim):
store_data = deepcopy(parameter.data)
# plus eps
parameter.data += self.epsilon * self.vector[i]
log_softmax_output1 = self.network(x)
log_softmax_output1[0,j].backward()
new_plus_log_softmax_grad = deepcopy(x.grad.data)
x.grad.zero_()
self.network.zero_grad()
softmax_output1 = F.softmax(self.network(x), dim=1)
softmax_output1[0,j].backward()
new_plus_softmax_grad = deepcopy(x.grad.data)
x.grad.zero_()
self.network.zero_grad()
# minus eps
parameter.data -= 2 * self.epsilon * self.vector[i]
log_softmax_output2 = self.network(x)
log_softmax_output2[0,j].backward()
new_minus_log_softmax_grad = deepcopy(x.grad.data)
x.grad.zero_()
self.network.zero_grad()
softmax_output2 = F.softmax(self.network(x), dim=1)
softmax_output2[0,j].backward()
new_minus_softmax_grad = deepcopy(x.grad.data)
x.grad.zero_()
self.network.zero_grad()
# reset and evaluate
parameter.data = store_data
fisher_sum += 1/(2 * self.epsilon)**2 * ((new_plus_log_softmax_grad - new_minus_log_softmax_grad)*(new_plus_softmax_grad - new_minus_softmax_grad))
return fisher_sum
import torch
import torch.nn as nn
import fisher
network = nn.Sequential(
nn.Linear(15,4),
nn.Tanh(),
nn.Linear(4,3),
nn.LogSoftmax(dim=1)
)
epsilon = 1e-3
input_data = torch.randn((1,15))
network.zero_grad()
output = network(input_data).max()
output.backward()
vector = []
for parameter in network.parameters():
vector.append(parameter.grad.clone())
FISHER = fisher.FISHER_OPERATION(input_data, network, vector, epsilon)
print("The fisher matrix quadratic form:", FISHER.fisher_quadratic_form())
print("The fisher matrix trace:", FISHER.fisher_trace())
print("The fisher information sensitivity:", FISHER.fisher_sensitivity())

边栏推荐
- Oracle JDBC 驱动
- Online JSON to yaml tool
- Growth is not necessarily related to age
- Google Earth Engine(GEE)——sentinel-1综合查看两个月前后自动滑坡监测,两者之间的差异(危地马拉为例)
- Online shopping website (final assignment)
- Notice on Revising the guidelines for the planning, design and livable construction of housing with common property rights in Beijing (for Trial Implementation)
- Huawei cloud releases desktop ide codearts
- Yaml data driven demo
- ESP8266/ESP32 通过TimeLib库获取NTP时间方法
- 如何判断DNS解析故障?如何解决DNS解析错误?
猜你喜欢
Go语言开发代码自测绝佳go fuzzing用法详解
![Undefined functions or variables [explained in one article] (matlab)](/img/fe/54272b8efce87ed7a78ac43b1fc189.png)
Undefined functions or variables [explained in one article] (matlab)

招募令|数据可视化开发平台“FlyFish”「超级体验官」招募啦!

The Google play academy team PK competition officially begins!

Pytest框架实现前后置的处理

Overseas new things | software developer "dynaboard" seed round raised US $6.6 million to develop low code platform to connect design, products and developers

Alibaba cloud server + pagoda panel + no domain name deployment web project
![[observation] Microsoft's](/img/70/c598ef50a4c06f9bdcb7a935dbf62b.png)
[observation] Microsoft's "cloud + end" comprehensive innovation makes hybrid cloud simpler, more flexible and more secure

我给航母做3D还原:这三处细节,太震撼了…

Pytest framework implements pre post processing
随机推荐
What are the differences between SVN and VSS
Which futures company is better to open an account at present? Is the service charge low and the transaction safe?
使用 Guzzle 中间件进行优雅的请求重试
Reinforcement learning introductory project spinning up (1) installation
微信小程序开发入门介绍-布局组件
撰写有效帮助文档的7大秘诀
Go language development code self test excellent go fuzzing usage explanation
[learn FPGA programming from scratch -38]: Advanced - syntax - functions and tasks
招募令|数据可视化开发平台“FlyFish”「超级体验官」招募啦!
Elegant request retry using guzzle Middleware
The first atlas showing the development history of the database in China was officially released!
How to write test cases
Do Internet companies do unit tests? Is it necessary to do unit testing for the needs of the bank?
我敢闯 我会创!第八届“互联网 +”大赛GaussDB命题开放报名啦!
一个好产品应该具备的特征
Yaml file details
D improve translation
鲁班会开发者深度论坛丨与成都伙伴一起 “洞见物联网新风潮”
强化学习入门项目spinning up(1)安装
快来围观–TPT18新版报到