当前位置:网站首页>[comparative learning] understanding the behavior of contractual loss (CVPR '21)
[comparative learning] understanding the behavior of contractual loss (CVPR '21)
2022-07-25 12:00:00 【chad_ lee】
Understanding the Behaviour of Contrastive Loss (CVPR’21)
Contrastive Loss Temperature coefficient in τ \tau τ Is a key parameter , Most papers put τ \tau τ Set to a small number , This article starts with the analysis of temperature parameters τ \tau τ set out , Analysis shows that :
- Contrast loss can actually automatically mine hard negative samples , Therefore, we can learn high-quality self-monitoring representations . In particular , For negative samples that have been far away , There is no need to keep it away ; Mainly focus on negative samples that are not far away ( Hard negative sample ), Thus, the representation space is more uniform ( It's similar to the red circle chart below ).
- temperature coefficient τ \tau τ The degree of mining negative samples can be controlled , τ \tau τ The smaller the sample, the more attention is paid to the difficult negative sample .
Hardness-Awareness
The widely used comparison loss function is InfoNCE:
L ( x i ) = − log [ exp ( s i , i / τ ) ∑ k ≠ i exp ( s i , k / τ ) + exp ( s i , i / τ ) ] \mathcal{L}\left(x_{i}\right)=-\log \left[\frac{\exp \left(s_{i, i} / \tau\right)}{\sum_{k \neq i} \exp \left(s_{i, k} / \tau\right)+\exp \left(s_{i, i} / \tau\right)}\right] L(xi)=−log[∑k=iexp(si,k/τ)+exp(si,i/τ)exp(si,i/τ)]
This loss function requires the second i Samples and it's another amplified ( just ) Similarity between samples s i , i s_{i,i} si,i As big as possible , And with other examples ( Negative sample ) The similarity between s i , k s_{i,k} si,k As small as possible . But there are many loss functions that satisfy this condition , For example, the simplest function L simple \mathcal{L}_{\text {simple }} Lsimple :
L simple ( x i ) = − s i , i + λ ∑ i ≠ j s i , j \mathcal{L}_{\text {simple }}\left(x_{i}\right)=-s_{i, i}+\lambda \sum_{i \neq j} s_{i, j} Lsimple (xi)=−si,i+λi=j∑si,j
But the training effect of these two loss functions is much worse :
| Data sets | Contrastive Loss | Simple Loss |
|---|---|---|
| CIFAR-10 | 79.75 | 74 |
| CIFAR-100 | 51.82 | 49 |
| ImageNet-100 | 71.53 | 74.31 |
| SVHN | 92.55 | 94.99 |
This is because Simple Loss The same weight penalty is given to all negative sample similarity : ∂ L simple ∂ s i , k = λ \frac{\partial L_{\text {simple }}}{\partial s_{i, k}}=\lambda ∂si,k∂Lsimple =λ, That is, the gradient of the similarity of the loss function to all negative samples is equal . But in Contrastive Loss in , It will automatically give higher penalties to negative samples with higher similarity :
The gradient of the positive sample : ∂ L ( x i ) ∂ s i , i = − 1 τ ∑ k ≠ i P i , k The gradient of negative samples : ∂ L ( x i ) ∂ s i , j = 1 τ P i , j ∝ s i , j \text { The gradient of the positive sample : } \frac{\partial \mathcal{L}\left(x_{i}\right)}{\partial s_{i, i}}=-\frac{1}{\tau} \sum_{k \neq i} P_{i, k} \\ \text { The gradient of negative samples : } \frac{\partial \mathcal{L}\left(x_{i}\right)}{\partial s_{i, j}}=\frac{1}{\tau} P_{i, j} \propto s_{i, j} The gradient of the positive sample : ∂si,i∂L(xi)=−τ1k=i∑Pi,k The gradient of negative samples : ∂si,j∂L(xi)=τ1Pi,j∝si,j
among P i , j = exp ( s i , j / τ ) ∑ k ≠ i exp ( s i , k / τ ) + exp ( s i , i / τ ) P_{i, j}=\frac{\exp \left(s_{i, j /} \tau\right)}{\sum_{k \neq i} \exp \left(s_{i, k} / \tau\right)+\exp \left(s_{i, i} / \tau\right)} Pi,j=∑k=iexp(si,k/τ)+exp(si,i/τ)exp(si,j/τ), For all negative samples , P i , j P_{i, j} Pi,j The denominator of is the same , therefore s i , j s_{i, j} si,j The bigger it is , The gradient term of negative samples is also larger , This gives the negative sample a greater gradient away from the sample .( It can be understood as focal loss, The harder it is, the greater the gradient ). Thus, all samples are encouraged to be evenly distributed on a hypersphere .
To verify the truth Contrastive Loss It's really because we can mine the characteristics of difficult negative samples , The article shows that some additional difficult samples are selected for Simple Loss On ( Select for each sample 4096 A hard negative sample ), Improved performance :
| Data sets | Contrastive Loss | Simple Loss + Hard |
|---|---|---|
| CIFAR-10 | 79.75 | 84.84 |
| CIFAR-100 | 51.82 | 55.71 |
| ImageNet-100 | 71.53 | 74.31 |
| SVHN | 92.55 | 94.99 |
temperature coefficient τ \tau τ Degree of control
temperature coefficient τ \tau τ The smaller it is , The loss function pays more attention to hard negative samples , Specially :
When τ \tau τ Tend to be 0 when ,Contrastive Loss Degenerate into focusing only on the hardest samples :
lim τ → 0 + 1 τ max [ s max − s i , i , 0 ] \lim _{\tau \rightarrow 0^{+}} \frac{1}{\tau} \max \left[s_{\max }-s_{i, i}, 0\right] τ→0+limτ1max[smax−si,i,0]
This means that One by one Push each negative sample to the same distance from yourself :

When τ \tau τ Approaching infinity ,Contrastive Loss Almost degenerate into Simple Loss, The weight is the same for all negative samples .
So the temperature coefficient τ \tau τ The smaller it is , The more uniform the distribution of sample characteristics , But this is not a good thing , Because the potential positive sample (False Negative) Also pushed away :

边栏推荐
- Application and innovation of low code technology in logistics management
- brpc源码解析(七)—— worker基于ParkingLot的bthread调度
- The applet image cannot display Base64 pictures. The solution is valid
- "Mqtt protocol explanation and Practice (access to onenet)" of wiznet w5500 series training activities
- 'C:\xampp\php\ext\php_ zip. Dll'-%1 is not a valid Win32 Application Solution
- 【无标题】
- JS常用内置对象 数据类型的分类 传参 堆栈
- PHP 上传ftp路径文件到外网服务器上 curl base64图片
- PHP uploads the FTP path file to the curl Base64 image on the Internet server
- Introduction to pl/sql, very detailed notes
猜你喜欢

PHP curl post x-www-form-urlencoded

W5500上传温湿度到oneNET平台

GPT plus money (OpenAI CLIP,DALL-E)

php curl post Length Required 错误设置header头

【GCN-RS】Region or Global? A Principle for Negative Sampling in Graph-based Recommendation (TKDE‘22)

The first C language program (starting from Hello World)

JS process control

Intelligent information retrieval(智能信息检索综述)

Make a reliable delay queue with redis

LeetCode 50. Pow(x,n)
随机推荐
GPT plus money (OpenAI CLIP,DALL-E)
一文入门Redis
【多模态】《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021
Solutions to the failure of winddowns planning task execution bat to execute PHP files
Attendance system based on w5500
What is the global event bus?
程序员送给女孩子的精美礼物,H5立方体,唯美,精致,高清
brpc源码解析(六)—— 基础类socket详解
【高并发】SimpleDateFormat类到底为啥不是线程安全的?(附六种解决方案,建议收藏)
JS运算符
【高并发】我用10张图总结出了这份并发编程最佳学习路线!!(建议收藏)
Functions in JS
【GCN-RS】Region or Global? A Principle for Negative Sampling in Graph-based Recommendation (TKDE‘22)
toString()与new String()用法区别
异构图神经网络用于推荐系统问题(ACKRec,HFGN)
Introduction to pl/sql, very detailed notes
W5500 upload temperature and humidity to onenet platform
【IMX6ULL笔记】--内核底层驱动初步探究
JS中的数组
W5500通过上位机控制实现调节LED灯带的亮度