当前位置：网站首页>[out of distribution detection] deep analog detection with outlier exposure ICLR '19

[out of distribution detection] deep analog detection with outlier exposure ICLR '19

2022-06-22 06:55:00 【chad_ lee】

Training anomaly detectors using anomaly data sets , This method is called abnormal exposure （Outlier Exposure,OE）. This enables the anomaly detector to generalize and detect unseen anomalies . In a large number of experiments on naturallanguageprocessing and small-scale and large-scale visual tasks , The article found that Outlier Exposure Can significantly improve detection performance .

Outlier Exposure

The so-called abnormal exposure , Is to introduce abnormal data to the exception detector , Let the model get inspiration from the existing abnormal data , Thus, we can generalize the exceptions we have never seen .

This article has only one formula , Model import OE After the optimization goal ：
$\mathbb{E}_{(x, y) \sim \mathcal{D}_{\text {in }}}\left[\mathcal{L}(f(x), y)+\lambda \mathbb{E}_{x^{\prime} \sim \mathcal{D}_{\text {out }}^{\text {OE }}}\left[\mathcal{L}_{\mathrm{OE}}\left(f\left(x^{\prime}\right), f(x), y\right)\right]\right]$
The first of these is $\mathcal{L}$ The optimization goal of the original model on the original task , The second item $\mathcal{L}_{OE}$ yes OE Optimization objectives , Depending on the task , The following experiments are defined one by one .

Data sets

IN-DISTRIBUTION DATASETS：SVHN、CIFAR、Tiny ImageNet、Places365、20 Newsgroups、TREC、SST. One of them is ID When dataset , In addition, similar data sets are used as ODD.

OUTLIER EXPOSURE DATASETS：80 Million Tiny Images、ImageNet-22K、WikiText-2. And are excluded ID Data set overlaps , Guarantee $\mathcal{D}_{\text {out }}^{\text {OE }}$ and $\mathcal{D}_{\text {out }}^{\text {test }}$ orthogonal .

Task a ： Multi category tasks

For one k Classification task , Input $\in \mathcal{X}$ , Classifier output $\in \mathcal{Y}=\{1,2, \ldots, k\}$ . classifier $\mathcal{X} \rightarrow \mathbb{R}^{k}$ , And for any $x$ , $1^{\top} f(x)=1$ and $\succeq 0$ .

Maximum Softmax Probability (MSP)

baseline. Enter a $x$ , Output OOD score $max _{c} f_{c}(x)$ .

fine-tuning The goal is ：
$\mathbb{E}_{(x, y) \sim \mathcal{D}_{\text {in }}}\left[-\log f_{y}(x)\right]+\lambda \mathbb{E}_{x \sim \mathcal{D}_{\text {ot }}^{\text {oE }}}[H(\mathcal{U} ; f(x))]$
among $H$ It's cross entropy , $\mathcal{U}$ yes k Uniform distribution of classes .

In fact, from the beginning of the training model, we added OE Regular items work better , The reason for choosing fine-tuning It's about time and GPU Memory .

added OE Of MSP Method in CV and NLP The effect is improved ：

Confidence Branch

《Learning confidence for out-of-distribution detection in neural networks》2018 The method proposed in , Study confidence, For a sample model, output a OOD fraction $\mathcal{X} \rightarrow[0,1]$ . So here we use OE Is to add... To the optimization objective of the original model ：
$\mathbb{E}_{x \sim \mathcal{D}_{\text {out }}^{\text {OE }}}[\log b(x)]$
effect ：

Synthetic Outliers

The author wants to use OE To solve the confrontation sample , So the author tried to disturb the picture with some noise , Then use these noisy images as OE Data sets . However, the author found that although this classifier can remember these noise features , But I can't recognize the new OOD sample . Then the author directly uses someone else's code 《Training confidence-calibrated classifiers for detecting out-of-distribution samples.》 That is to say MSP On the basis of GAN fine-tuning, That is for GAN The generated samples give a high OOD fraction . Then the author uses OE fine-tuning, The effect is further enhanced . I didn't understand the details of how to implement this part , This part is short , There is no appendix .

Task 2 ： Density estimation

The density estimator learns the data distribution $\mathcal{D}_{\text {in }}$ Probability density function on , Abnormal samples should have a lower probability density , Because they seldom appear in $\mathcal{D}_{\text {in }}$ .

Pixel CNN++

A sample $x$ Of OOD score use bits per pixel(BPP) Expressed as nll(x)/num_pixels, among nnl yes negative log-likelihood. here OEis implemented with a margin loss over the log-likelihood difference between in-distribution and anomalous examples. So from $\mathcal{D}_{in}$ The sample of $x_{in}$ And from the $\mathcal{D}_{\text {out }}^{\text {OE }}$ The outliers of $x_{out}$ Of loss yes ：
$num_pixels + nll ⁡ ( x in ) − nll ⁡ ( x out ) } \max \left\{0, \text { num\_pixels }+\operatorname{nll}\left(x_{\text {in }}\right)-\operatorname{nll}\left(x_{\text {out }}\right)\right\}$

Language Modeling

use QRNN As baseline OOD detectors.OOD score use bits per character (BPC) or bits per word (BPW), Defined as nll(x)/sequence_length. among nnl( $x$ ) It's a sequence $x$ Of negative log-likelihood.OE adopt adding the cross entropy to the uniform distribution on tokens from sequences in $\mathcal{D}_{\text {out }}^{\text {OE }}$ as an additional loss term Realization .

Summary

Proposed by the author OE There are many advantages , The conclusions given by the author are ： Extensibility （ Many tasks can be used ）、 You can choose flexibly $\mathcal{D}_{\text {out }}^{\text {OE }}$ 、OE It can improve the accuracy of the model itself .

But I think what the author said “ $\mathcal{D}_{\text {out }}^{\text {OE }}$ The model can be inspired , So as to generalize and recognize what has not been seen $\mathcal{D}_{\text {out }}^{\text {test }}$ ” Some are too mysterious , After all, the distribution of such things can not be accurately divided , It's kind of like “Transfer Unlearning”.

I think the advantage of this article is a lot of 、 A full and accurate experiment , Don't tell more stories , Each point is explained experimentally .

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202220543470427.html