当前位置:网站首页>Is binary cross entropy really suitable for multi label classification?
Is binary cross entropy really suitable for multi label classification?
2022-07-25 09:59:00 【Tobi_ Obito】
When I was in charge of a multi label text classification project, I used to classify multiple labels (multi-label classification) Of loss choice 、forward The tail layer processing is confused , At that time, I searched the information and determined a scheme :
1、 With Number of categories Output the number of nodes as the last hidden layer , With sigmoid Activate . This is actually equivalent to treating each category as 1 Two sub tasks , Finally, the output of the hidden layer corresponds to a category at each position . That's why , use sigmoid But can't use softmax(softmax Each node is not considered independent , contrary , Will lead to mutual influence ).
2、 use Binary Cross Entropy loss Training . If each node of the last hidden layer corresponds to a category 1/0 classification , that BCE loss It's really a natural choice . But is this really good ? The answer I give is not necessarily , And the thinner the label, the worse .
Introduction of key issues
Let's compare BCE And for multi classification tasks CrossEntropy The performance of the sample :( In order to direct , We don't stick our own formulas here , It's all over the floor , It is not helpful to understand this key point here , Or directly highlight the key with calculation examples )
Multi category tasks -CE
label:[0,1,0,0]
pred:[0.3, 0.67, 0.4, 0.25]
CE loss = 0 x log(0.3) + 1 x log(0.67) + 0 x log(0.4) + 0 x log(0.25)
Key distinguishing features : With CE As loss when , What works for model learning Only for label=1 Corresponding to position pred value . On the contrary ,0 Corresponding to position pred Value, big or small , about loss The calculation of No impact . Remember that , Look again. BCE.
Multi label classification task -BCE
label:[0,1,0,1]
pred:[0.3, 0.67, 0.4, 0.25]
BCE loss = 0 x log(0.3) + (1-0) x log(0.7)
+ 1 x log(0.67) + (1-1) x log(0.33)
+ 0 x log(0.4) + (1-0) x log(0.6)
+1 x log(0.25) + (1-1) x log(0.75)
Key distinguishing features : With BCE As loss when , Whether it's label=0 The position is still label=1 The position of both have an impact on model learning . This means that training will produce an effect :label A certain position of is 0 It will guide the model to output values to 0“ close ”.
It seems like a good thing , you 're right , It's really a good thing only for the classification of categories corresponding to this position , Therefore, the second classification task adopts BCE Absolutely nothing wrong . however , Now we are doing the task of multi label classification , Must consider this loss Caused by the Relevant global impact . Let's look at an example :
label:[0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Only 2 The positions are 1, That is to say, when the total number of labels is large , The label that this data conforms to is only 2 individual . This will lead to The dominant model learning is a large number of 0 instead of 1, And the characteristic information contained in the data text Only with 1 It's related ( Determine whether a piece of information belongs to a certain category , It is determined by what kind of characteristics it has , Not because it does not contain features of other categories ). As a result, there is a great amount of information ( Outstanding features ) Instead, I didn't learn anything helpful to classification in the text . If you still don't understand , Think about it this way : Teach children to know things , But every time just tell it is not a,b,c... As a result, it can only use exclusion method to identify each item , Obviously, the effect will be much worse .
summary
The key to the problem is Irrelevant information dominates model learning 1 forecast . If under this task , The number of real labels of most data is too large , So it's not a problem , But in most actual scenarios of multi label text classification tasks , The number of real tags of a piece of data is basically insignificant compared with the overall number of tags , This problem has a great impact on the training effect . therefore BCE For multi label classification is not a “ Safe to use ” Methods .
边栏推荐
- ¥ 1-2 example 2.2 put the union of two sets into the linear table
- Linked list -- basic operation
- First knowledge of opencv4.x --- image convolution
- Gartner 2022年顶尖科技趋势之超级自动化
- CCF 201512-3 画图
- CDA Level1知识点总结之多维数据透视分析
- CCF 201509-3 模板生成系统
- 一个硬件攻城狮的经济学基础
- CCF 201509-4 Expressway
- First knowledge of opencv4.x --- image histogram equalization
猜你喜欢

TM1638 LED数码显示模块ARDUINO驱动代码

Arm preliminaries

Terminal definition and wiring of bsp3 power monitor (power monitor)

Segmentation based deep learning approach for surface defect detection

MLX90640 红外热成像仪测温模块开发笔记(五)

Defect detection network -- hybrid supervision (kolektor defect data set reproduction)

pytorch使用tensorboard实现可视化总结

Mixed supervision for surface defect detection: from weakly to fully supervised learning

深入理解pytorch分布式并行处理工具DDP——从工程实战中的bug说起

手持振弦VH501TC采集仪传感器的连接与数据读取
随机推荐
Swift creates weather app
[RNN] analyze the RNN from rnn- (simple|lstm) to sequence generation, and then to seq2seq framework (encoder decoder, or seq2seq)
ESP32定时中断实现单、双击、长按等功能的按键状态机Arduino代码
Advanced introduction to digital IC Design SOC
入住阿里云MQTT物联网平台
Defect detection network -- hybrid supervision (kolektor defect data set reproduction)
TensorFlow raw_rnn - 实现seq2seq模式中将上一时刻的输出作为下一时刻的输入
ADC介绍
ARMV8体系结构简介
MySQL与Navicat安装和踩坑
First knowledge of opencv4.x --- image histogram matching
Introducing MLOps 解读(一)
用Arduino写个ESP32看门狗
Evolution based on packnet -- review of depth estimation articles of Toyota Research Institute (TRI) (Part 1)
关于MLOps中的数据工程,你一定要知道的.......
从鱼眼到环视到多任务王炸——盘点Valeo视觉深度估计经典文章(从FisheyeDistanceNet到OmniDet)(上)
SystemVerilog语法
多通道振弦、温度、模拟传感信号采集仪数据查看和参数修改
Eco introduction
MLX90640 红外热成像仪测温模块开发说明