当前位置：网站首页>Is binary cross entropy really suitable for multi label classification?

Is binary cross entropy really suitable for multi label classification?

2022-07-25 09:59:00 【Tobi_ Obito】

When I was in charge of a multi label text classification project, I used to classify multiple labels (multi-label classification) Of loss choice 、forward The tail layer processing is confused , At that time, I searched the information and determined a scheme ：

1、 With Number of categories Output the number of nodes as the last hidden layer , With sigmoid Activate . This is actually equivalent to treating each category as 1 Two sub tasks , Finally, the output of the hidden layer corresponds to a category at each position . That's why , use sigmoid But can't use softmax（softmax Each node is not considered independent , contrary , Will lead to mutual influence ）.

2、 use Binary Cross Entropy loss Training . If each node of the last hidden layer corresponds to a category 1/0 classification , that BCE loss It's really a natural choice . But is this really good ？ The answer I give is not necessarily , And the thinner the label, the worse .

Introduction of key issues

Let's compare BCE And for multi classification tasks CrossEntropy The performance of the sample ：（ In order to direct , We don't stick our own formulas here , It's all over the floor , It is not helpful to understand this key point here , Or directly highlight the key with calculation examples ）

Multi category tasks -CE

label：[0,1,0,0]

pred：[0.3, 0.67, 0.4, 0.25]

CE loss = 0 x log(0.3) + 1 x log(0.67) + 0 x log(0.4) + 0 x log(0.25)

Key distinguishing features ： With CE As loss when , What works for model learning Only for label=1 Corresponding to position pred value . On the contrary ,0 Corresponding to position pred Value, big or small , about loss The calculation of No impact . Remember that , Look again. BCE.

Multi label classification task -BCE

label：[0,1,0,1]

pred：[0.3, 0.67, 0.4, 0.25]

BCE loss = 0 x log(0.3) + (1-0) x log(0.7)

+ 1 x log(0.67) + (1-1) x log(0.33)

+ 0 x log(0.4) + (1-0) x log(0.6)

+1 x log(0.25) + (1-1) x log(0.75)

Key distinguishing features ： With BCE As loss when , Whether it's label=0 The position is still label=1 The position of both have an impact on model learning . This means that training will produce an effect ：label A certain position of is 0 It will guide the model to output values to 0“ close ”.

It seems like a good thing , you 're right , It's really a good thing only for the classification of categories corresponding to this position , Therefore, the second classification task adopts BCE Absolutely nothing wrong . however , Now we are doing the task of multi label classification , Must consider this loss Caused by the Relevant global impact . Let's look at an example ：

label：[0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Only 2 The positions are 1, That is to say, when the total number of labels is large , The label that this data conforms to is only 2 individual . This will lead to The dominant model learning is a large number of 0 instead of 1, And the characteristic information contained in the data text Only with 1 It's related （ Determine whether a piece of information belongs to a certain category , It is determined by what kind of characteristics it has , Not because it does not contain features of other categories ）. As a result, there is a great amount of information （ Outstanding features ） Instead, I didn't learn anything helpful to classification in the text . If you still don't understand , Think about it this way ： Teach children to know things , But every time just tell it is not a,b,c... As a result, it can only use exclusion method to identify each item , Obviously, the effect will be much worse .

summary

The key to the problem is Irrelevant information dominates model learning 1 forecast . If under this task , The number of real labels of most data is too large , So it's not a problem , But in most actual scenarios of multi label text classification tasks , The number of real tags of a piece of data is basically insignificant compared with the overall number of tags , This problem has a great impact on the training effect . therefore BCE For multi label classification is not a “ Safe to use ” Methods .

原网站

版权声明
本文为[Tobi_ Obito]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207250921540638.html

当前位置：网站首页>Is binary cross entropy really suitable for multi label classification?

Is binary cross entropy really suitable for multi label classification?

Introduction of key issues

summary

边栏推荐

猜你喜欢

随机推荐