当前位置：网站首页>[outside distribution detection] your classifier is secret an energy based model and you head treat it like one ICLR '20

[outside distribution detection] your classifier is secret an energy based model and you head treat it like one ICLR '20

2022-06-22 06:55:00 【chad_ lee】

https://arxiv.org/pdf/1912.03263v3.pdf

The commonly used classifier models are modeling $p_{\theta}(y \mid \mathbf{x})$ , This article explains the classification model from the perspective of energy , Then a hybrid model of generation model and classification model is obtained . The model can model at the same time $p_{\theta}(y \mid \mathbf{x})$ and $p_{\theta}(\mathbf{x})$ , Thus, the classification accuracy and sample generation quality are improved .

This article is also used as OOD Tested baseline.

Joint Energy-based Model（JEM）

First come overview Let's look at the model structure ：

A neural network classification model is input into Softmax The value of a function is called $f_{\theta}(x)$ , The traditional classifier model uses $f_{\theta}(x)$ Input to softmax Estimate in function $\mid \mathbf{x})$ , This article also uses $f_{\theta}(x)$ To estimate $\mathbf{x},y)$ and $p(\mathbf{x})$ .

The method of this paper

EBM

Energy-based model：
$p_{\theta}(\mathrm{x})=\frac{\exp \left(-E_{\theta}(\mathrm{x})\right)}{Z(\theta)} \tag{1}$
among $E_{\theta}(\mathrm{x}): \mathbb{R}^{D} \rightarrow \mathbb{R}$ It's an energy function , $Z(\theta)=\int_{\mathbf{x}} \exp \left(-E_{\theta}(\mathbf{x})\right)$ It's a partition function （ Don't worry about this ）. To train this function, consider the method of optimizing log likelihood , Yes $\theta$ Find gradient （ These are the two of this article loss One of ）：
$\frac{\partial \log p_{\theta}(\mathrm{x})}{\partial \theta}=\mathbb{E}_{p \theta\left(\mathrm{x}^{\prime}\right)}\left[\frac{\partial E_{\theta}\left(\mathrm{x}^{\prime}\right)}{\partial \theta}\right]-\frac{\partial E_{\theta}(\mathrm{x})}{\partial \theta} \tag{2}$
What is more difficult is to start from $p_{\theta}(x)$ In the sample , Early training EBM Use MCMC Method , In this paper, a new Stochastic Gradient Langevin Dynamics (SGLD)：
$\mathbf{x}_{0} \sim p_{0}(\mathbf{x}), \quad \mathbf{x}_{i+1}=\mathbf{x}_{i}-\frac{\alpha}{2} \frac{\partial E_{\theta}\left(\mathbf{x}_{i}\right)}{\partial \mathbf{x}_{i}}+\epsilon, \quad \epsilon \sim \mathcal{N}(0, \alpha)\tag{3}$
This method and PGD Somewhat similar , The intuitive explanation here is sampling $x$ Go to the place with low energy , One training sampling $N$ Time . Recent work shows that SGLD The result of is close to the formula （2）.

Proposed JEM

Consider one $K$ Classification problem , $f_θ : R^D → R^K$ , It can put every data point $x ∈ R^D$ The mapping is called logit The real value of . Using so-called softmax Migration functions , You can put these logit Used to parameterize the class distribution ：
$p_{\theta}(y \mid \mathbf{x})=\frac{\exp \left(f_{\theta}(\mathbf{x})[y]\right)}{\sum_{y^{\prime}} \exp \left(f_{\theta}(\mathbf{x})\left[y^{\prime}\right]\right)} \tag{4}$
among $f_{\theta}(x)[y]$ Is the th of the network output vector $k$ Weight . With these logit, There is no need to change the model , by x and y The joint distribution of defines an energy based model ：
$p_{\theta}(\mathbf{x}, y)=\frac{\exp \left(f_{\theta}(\mathbf{x})[y]\right)}{Z(\theta)} \tag{5}$
Through to $y$ marginalized （ integral ）, Or for $x$ Get a non normalized density model ：
$p_{\theta}(\mathbf{x})=\sum_{y} p_{\theta}(\mathbf{x}, y)=\frac{\sum_{y} \exp \left(f_{\theta}(\mathbf{x})[y]\right)}{Z(\theta)}\tag{6}$
Some data $x$ The energy of is ：
$E_{\theta}(\mathbf{x})=-\log \operatorname{SumExp}_{y}\left(f_{\theta}(\mathbf{x})[y]\right)=-\log \sum_{y} \exp \left(f_{\theta}(\mathbf{x})[y]\right)\tag{7}$
Define it to optimize our model , Our optimization goal is to maximize likelihood $p (x, y)$ , Break it down ：
$\log p_{\theta}(\mathbf{x}, y)=\log p_{\theta}(\mathbf{x})+\log p_{\theta}(y \mid \mathbf{x}) \tag{8}$
Through the optimization of the last two items to achieve the optimization goal , $\log p_{\theta}(y \mid \mathbf{x})$ Optimization with standard cross entropy , $\log p_{\theta}(\mathbf{x})$ use SGLD Formula （2） Optimize .

The above is the proposed method , A useful formula is (2)(3)(8).

application

The hybrid model proposed in this paper can be classified , There are many other functions , Pick three main points ：

Or generate the model

You can generate samples ：

I had a long discussion with Jiaming , Combined with the code of the article , I guess from the formula （3） Pictures generated in , That is, the sampled pictures

OOD detection

There is an energy function , It is natural that it can be used for anomaly detection . It doesn't work $E (x)$ To detect , Instead, it proposes an indicator ：
$s_{\theta}(\mathbf{x})=-\left\|\frac{\partial \log p_{\theta}(\mathbf{x})}{\partial \mathbf{x}}\right\|_{2}$
effect ：

Robustness

The formula （3） Of SGLD The process itself is very much like PGD, A lot of unreal samples were taken to participate in the training , It is also a matter of course to improve robustness .

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202220543470386.html