当前位置：网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 9

Notes of Teacher Li Hongyi's 2020 in-depth learning series 9

2022-07-24 22:59:00 【ViviranZ】

Look blind .... At least take a note

https://www.bilibili.com/video/BV1UE411G78S?from=search&amp;amp;amp;

Finally, let's talk about A3C La la la ！

First of all, let's review Policy Gradient, Considering discount factor and baseline, But this formula is very unstable Of , This is because in the s for a What happens later is also highly random , So I got G Of varience It's big , Although if we can repeat it enough times, we can avoid the results obtained in the example G=100 The situation of , But actually we sample Very few times , Therefore, it is natural for us to consider whether we can obtain G Instead of us sample Obtained in the process of G.

So we also review how to get this “ expect ”： Yes V Yes Q, And by definition, we know that it is indeed what we want “ expect ”.

We can know below Q The formula of is indeed the sum above , and b As we said before, the definiteness is in this state The expectations of the , So the formula in the red box means “ stay s_t take a_t And then get reward The expectations of the ” Subtract “ stay s_t Own expectations （ That is, for different action The mean value obtained by synthesizing the different results of ）”、 That is to say s_t Take on a_t Of “ benefits ”（ Maybe this itself state It's good , In order to prevent “ Pigs standing in the air ” Effect we want to exclude “ tuyere ” Influence - See the previous notes for details ）.

But there may be a problem with this method , Namely Q and V You need two network , respectively, train, But in itself network There are errors , So the error of this result is two network The sum of is very big , And two network It's also complicated , So we just trainV, It is also because Q It can be used r and V Express .Q The expression expressed as expectation is understandable , But expect us to directly sample The result is approximate . because V It is the expectation itself, so it is expected E In fact, it is only for r, To put it bluntly, we are r（ stay s To execute a To obtain the reward） The expectation of sample The result of approximately replaces . You can also imagine by looking at the blue purple formula below orange , Itself orange formula （ Just like the previous one ppt In terms of notes ） Namely “ stay s_t perform a_t Got reward Our expectations exceed those in s_t Perform various action The degree of the mean value of ”, Therefore, it is easy to know that it can be understood as “ stay s_t perform a_t Get instant reward And s_t perform a_t Arrived at s_{t+1} Of reward Expectations and s_t Difference The sum of the ”…… Hiss seems not enough for people —— That is the difference between the reward for one step and the average reward for arriving at a new place and starting point .

More specifically ——A3C This formula works best in the paper = =

Something about A2C Of tips：

1. actor and critic Parameters of can be shared （ But not completely shared , The two networks are not exactly the same ）.2. each action Of output It won't make a big difference , We hope to explore extensively .

Here is A3C La ！ Like Kakashi N Practice together with multiple shadow parts ！

A3C Specific process ： every last worker（ May occupy a single CPU） First copy Parameters , Then interact with the environment update Parameters , For comparison diverse Of data, every last actor The initial point may be very different , Then each actor Separate calculation gradient, then （ The upper triangle is typo It should be the lower triangle ） hold gradient Send it back to the control center , Then overwrite the previous \theta.

Let's talk about another method ：

This method , Different from the previous actor-critic Just tell you if it's good , I will also tell you which is better

The specific method is as follows ：critic analogy generater What? ,GANS What is it ....？ But this step is still very easy to understand

Algorithm ：

More specifically , And Q-learning contrast

Find one. Q, stay s adopt Q Add a little exploration and find one a, Then get r Deposit in buffer in . Every step of sampling updates sampling updates .