当前位置：网站首页>Semi supervised learning—— Π- Introduction to model, temporary assembling and mean teacher

Semi supervised learning—— Π- Introduction to model, temporary assembling and mean teacher

2022-06-27 08:47:00 【umbrellalalalala】

Know that the account with the same name is released synchronously .

Learn the idea of semi supervision through two papers . The paper ：

arxiv1610(ICLR17)Temporal Ensembling for Semi-Supervised Learning
arxiv1703(NIPS17)Mean teachers are better role models- Weight-averaged consistency targets improve semi-supervised deep learning results

Brief introduction Π-Model、Temporal Ensembling、Mean Teacher. Welcome to exchange , If you like, please pay attention to , And then keep updating .

Catalog

One 、Π-model brief introduction
Two 、Temporal Ensembling brief introduction
3、 ... and 、mean teachers

The integration of multiple networks is often stronger than a single network .

In the past dropout、dropconnect、stochastic depth And other technologies indirectly prove this , And in swapout network in , Training focuses on a specific subset of the network . These technologies make the training results of the network can be regarded as ensemble of trained sub-networks.

The author extends the above viewpoint to the difference of a single network epochs Output （ Combining different regularization and enhancement of input data ） To integrate ：

We extend this idea by forming ensemble predictions during training, using the outputs of a single network on different training epochs and under different regularization and input augmentation conditions.

Training is still done on a single network , But because of dropout, Different epoch The result of prediction corresponds to the integrated prediction of a large number of single subnetworks .

In a word, the difference of a single network epoch The prediction results are integrated , And this kind of integrated prediction （ensemble prediction） It can be used for semi supervision . If the integrated prediction results are compared with the output of the current network being trained , Then the result of integrated prediction should be closer to the unknown label corresponding to the unlabeled data . Therefore, the result of the integrated prediction can be used as the input without label label（ The integrated forecast result can be regarded as a pseudo label ）.
（ But here I have a question , Because the premise is that the integration of multiple networks is more powerful than a single network , Although different epoch The integration of can be regarded as the integration of different networks , But later epoch It's arguably better than the previous epoch Better , So put the current epoch And all before epoch Do integration , I don't think it will be more powerful , Because before all epoch They are all more delicious epoch. So I think the reason why this is useful may not be that this integration will be stronger , But this integration will be more delicious , So it can be regarded as a kind of regularization ）

The author's method relies heavily on dropout Regularization and rich variety input Data to enhance , If neither , Then we can infer from the above method unlabeled data Of label（ Pseudo label ） There is not much credibility .

The method proposed by the author is self-ensembling, Further, the author finds that , In the case of full labels , This method can also improve the classification effect , It also provides tolerance for wrong labels .

Two ways to achieve self-ensembling,Π-model and temporal ensembling.

The author uses the classification problem to illustrate the method ,N individual input data ,M a label, There are a total of C class .

One 、Π-model brief introduction

Insert picture description here
The model flow chart is shown in the above figure , Looking at the pseudo code is enough to understand it ：

x Express input,y Express label,z Represents the predicted value . There are two z, Said to input Do different data enhancements 、 Through different dropout Two prediction results generated by the network , One z and label y Do cross entropy loss , Then two z Do mean square error loss between . Weighted sum of the two losses （ With tag data, both loss items are used , Only the second loss item is used for unlabeled data ）, The network parameters are ADAM To optimize .

About $z_i$ and $\tilde{z_i}$ Minimize between （loss The second part of the article ）,paper Some statements are given in ：
1, Let two z Between dark knowledge As close as possible , This is a much stronger requirement than requiring that only the final classification remain unchanged .
2, because dropout The existence of , In the training process, the output of the network is a random variable , For the same input And the same network , Will produce different outputs （ For the same x, The resulting two z inequality ）. The same is true of data enhancement , It will also cause two z Between difference, This difference It can be regarded as an error in classification （an error in classification）, Because of two z Corresponding input x Is the same , So minimize two z Between difference Is a reasonable goal .

* For unlabeled data , Scholars have put forward a Consistency constraints hypothesis ：

For unlabeled data , Add some disturbance to the model and data , The prediction results are consistent .
Content sources ：https://blog.csdn.net/u011345885/article/details/111758193
Aforementioned dropout Is a disturbance to the model , Data enhancement is the disturbance of data .

About putting two z Minimize the difference between , There are two other ideas ：
1, Uniform regularization ： It refers to constructing according to the above consistency constraints loss;
2, Pseudo label ： It means that one of them z Treat as a fake tag , Let the other z Approaching this pseudo tag .
（ If it is with thought 2 To look at π model Without supervision loss term, The idea of false labels , that temporal ensembling Is to improve the pseudo label , I'll talk about it later ）

About weight $w$ , Its formula is $w(t)=exp[-5(1-T)^2]$ , before 80 individual epochs,T Linearly from 0 become 1, thus w The value of is gradually changed from a small positive number to 1. So in the beginning , Training mainly depends on loss The supervised component of , That is, it depends only on the tag data . It should be noted that loss The unsupervised component of the must rise slowly enough , Otherwise, the network will easily fall into the degenerate solution , Unable to obtain meaningful classification .

（ The author is giving w(t) Other training details are also given in the appendix of the formula ： In addition to the above-mentioned weights, they are in the first place 80 individual epochs In addition to the changes in , Learning rate and Adam $\beta_1$ All need attenuation ,batchsize yes 100, The network trains 300 individual epochs.）

Two 、Temporal Ensembling brief introduction

If you will Π-model Medium $\tilde z_i$ As $z_i$ False label words , So this fake tag is not very good ,temporal ensembling This is the point for improvement , Specific view paper Picture in ：
Insert picture description here
Pay attention to the corner mark in the figure above $i$ It doesn't mean timing , It means a total of $N$ Number of data $i$ individual , Generated $z_i$ To participate in the generation of the next epochs Of the $i$ Pseudo tags of data .

（ Note that each epoch, Not every one of them batch, To change a fake tag , This change is actually very slow . Later work, such as mean teacher Also pointed out , This method is very difficult to handle for large data sets ）

Note the pseudocode above , $\tilde{z}$ Express $N$ Pseudo tags of data , Each pseudo tag $\tilde{z_i}$ It's a $C$ Dimension vector , The author means in minibatch Can complete the cycle of $\tilde{z}$ Update （ One per cycle $\tilde{z_i}$ ）, But for clarity , The pseudo code writes the update in epoch In the cycle of .

and Π-model The difference is end for Two lines after the statement , The author will α Set to 0.6, The author of the second line of these two lines calls it ： Yes startup bias The correct , The author says this is the same as Adam It's similar ：

A similar bias correction has been used in, e.g., Adam (Kingma & Ba, 2014) and mean-only batch normalization (Salimans & Kingma, 2016).

It's actually because $Z$ Is to use $\alpha Z + (1 - \alpha)z$ The formula for calculating , Is a historical value $Z$ And new values $z$ Add up （ Pay attention to the word "accumulation" , The following sentences will also mention ） income . At the very beginning $Z$ by 0, So the calculated $Z$ The value is $\alpha)z$ , This is the time $t$ by 1, be $\alpha^t)$ Namely $z$ In itself , That is to enlarge the value to what it should be . With $t$ The increase of , $\alpha^t)$ The value of is getting closer and closer to 1, Then divided by its magnification effect will be weakened . That is to say, the denominator comes into play at the beginning , The solution is Add up At the beginning, the value is too small （ Because as just said , Accumulation is a historical value $Z$ And new values $z$ To sum by weight , At the beginning , Historical value $Z$ Very small , Even in $t$ by 1 The historical value is 0）, That's why it solves startup bias, That is, the deviation in the initial stage （ Partial ）, Divide by a less than 1 To enlarge it .

The idea of moving average is used to construct pseudo tags , So in the first one epochs Some parameters need to be set separately , first epochs Medium w(t) Set to 0, Express loss There is only a supervisory component in .

temporal ensembling be relative to Π-model The benefits of ：
1, Faster training , Because a epochs No longer count two output z;
2, The training result is better than Π-model Not to mention noisy（ The author didn't say what it meant , The result should be more stable ）.

Second, the training targets $\tilde z$ can be expected to be less noisy than with Π-model.

Reference material

https://blog.csdn.net/u011345885/article/details/111758193

3、 ... and 、mean teachers

The author points out that temporal ensembling The shortcomings of ： Every epoch Update the pseudo tag once , If you are facing a large data set , Then this update method will become very slow , This is very problematic . To overcome this problem , The method proposed by the author is the moving average of the weights of the model , Instead of moving average for the generation of pseudo tags .

To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions.

The author explains the existing models ： The model has a dual identity —— Students and teachers . As a student , The model continues to learn ; As a teacher , The model generates a target（ Pseudo tag ）.（ Be careful mean teacher Every batch Updated once teacher model Parameters of , See the later ）

Because the model generates itself target, This may cause incorrect things , So promote target In other words, the quality of false labels is something to consider . The author believes that there are two ways to improve target The way ：

Carefully select the disturbance to the data or model , Not just additive or multiplicative noise .（ Add paper Description in ： For two similar data points , A good model should give the same prediction results ）
Choose one carefully teacher model, Not directly student model Copy it as teacher model In itself .

The first method has been used by the following methods ：

Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, and Ishii, Shin. Virtual Adversarial Training: a Regularization Method for Supervised and Semi-supervised Learning. arXiv:1704.03976 [cs, stat], April 2017. arXiv: 1704.03976.
link ：https://ieeexplore.ieee.org/document/8417973
(TPAMI, 421 paper citations (20220507))

The author adopts the second method , The specific way is , about Every batch, Update by back propagation student model Parameters of , And then use EMA How to update teacher model Parameters of ：
Insert picture description here
student model and teacher model Have the ability to classify , But after training ,teacher model There may be a better accuracy .

So-called EMA The way , Just use student model Parameters of , Update according to the following formula teacher model Parameters of （ Note that the first step is to put student model Copy the parameters of to teacher model）：
Insert picture description here
stay ramp-up Stage ,α The value of is set to 0.99, Set to... During subsequent training 0.999（ $\alpha$ The bigger it is ,student model The parameters of are right teacher model The smaller the influence of the parameter ）. This is because initially student The model is trained very quickly , and teacher Need to forget the previous 、 incorrect student The weight ; stay student When promotion is slow , teacher The longer the memory, the better .

This paragraph refers to ： Semi-supervised learning ：Π-Model、Temporal Ensembling、Mean Teacher

In fact, if you tell your personal thoughts , I think teacher model The existence of will make student model The update speed of is slow （ Because the consistency cost）, So as to play a regularization role . But it can also be used before temporal ensembling The thought there , It is considered that the integration of multiple models is stronger than that of a single model （ because teacher model The model parameters of are for each stage student model Moving average of model parameters , So it can be regarded as different stages student model Integration of . and $\alpha$ Very close to 1, therefore teacher model The update speed of parameters is very slow ）.

But I personally think that the idea of regularization can convince me , And called other big guys' blogs , There is a word called “consistency regularization” Uniform regularity , So it should be possible to look at these methods from the perspective of regularization ：

Π-Model、Temporal Ensembling and Mean Teacher All three use consistent regularity （consistency regularization） For semi supervised learning （semi-supervised learning）.
source ：https://blog.csdn.net/chanbo8205/article/details/108846097

Back to the title , An example of a simple application of self-monitoring is also shown in the figure above , Is the problem of number recognition , The workflow is clearly described , I won't go into that .

The author's experiment is compared with some methods , Among them, the Π-model Copy it as baseline, Then change it to use weight-averaged consistency In the form of （ Pictured above consistency cost）, Write it down as Π (ours). Let's take a look at the experiment done by the author and the results ：
Insert picture description here
As can be seen in the label Less time ,mean teacher Is much better , Part of the situation mean teacher Not the best . It is worth noting that the original Π-model stay label When it is not complete, it is better than Π (ours) good , This does not seem to explain the fact that Π-model Under the circumstances , Make the model use weight-averaged consistency Will get better results .

See here first , Other experiments will not be repeated . If there is any disagreement , Welcome to comment area .

原网站

版权声明
本文为[umbrellalalalala]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206270843138181.html

当前位置：网站首页>Semi supervised learning—— Π- Introduction to model, temporary assembling and mean teacher

Semi supervised learning—— Π- Introduction to model, temporary assembling and mean teacher

Catalog

One 、Π-model brief introduction

Two 、Temporal Ensembling brief introduction

3、 ... and 、mean teachers

边栏推荐

猜你喜欢

随机推荐