当前位置：网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 5

Notes of Teacher Li Hongyi's 2020 in-depth learning series 5

2022-07-24 22:59:00 【ViviranZ】

Look blind .... At least take a note

https://www.bilibili.com/video/BV1UE411G78S?from=search&amp;

1. On-policy & off-policy

If you are learning while playing on If you stand by and watch others play and learn, it is off

Make yourself at home , Is it the difference between increasing experience while fighting monsters and seeing how others fight monsters and steal teachers first ？

We used it before gradient descent yes on-policy Of , This is because we have to update every time \theta, So we are constantly updating ourselves as we move forward . But we want to try to use off-policy, Because in this case, everyone data（ state 、 Functions and choices ） Can be used many times ,（on-policy Every operation of is used up and immediately discarded ）

A trick ：【importance sampling】（ No RL A dedicated , Very common 】

A lot of times , What we want to study is real x Of distribution/p(x)/ And what we can get x' Of q(x) It's different , So we need to treat the original E(X) Modification of calculation formula （ Otherwise, the result is and q(x) Relevant but not embodied p(x) Characteristics ）.

The formula we use is f(x)*p(x)/q(x) stay q(x) Decisive trajectory The way to expect , The key is to be q(x)=0 At the time of the x Need to meet p(x)=0, That is, for q(x) There are some limitations .

Besides , Before we talk about why we should use importance sampling Let's talk about some related knowledge before . in application ,p(x) and q(x) Still can't make a big difference , This is because although the two expectations are the same, there are differences in variance and other information . Therefore, once the sampling is not enough or not good enough, it is likely that the results will be very different due to variance .

A small summary ：

Let's push the formula again

The main application is derivative And the formula of conditional probability , Needless to say . The key lies in the proportion of red thread spent , There are three ways to understand why it can be removed ：

1. In itself state The probability of occurrence and trajectory and action It doesn't matter , So we can ignore

2. There are some in practical application state Even only once （ For example, image recognition ） So it's hard to find p_\theta and p_\theta', So just brainwash yourself and say it's not important = =

3. It can be understood as a and s Independent , Therefore, the conditional probability and the joint probability are equal （ In essence 1.）

How to ensure \theta and \theta' Similar enough ？？——————PPO!

Let's talk about it PPO And its predecessor TRPO, The specific method is to add a similar regularation（ML Of ） The item \beta KL(\theta,\theta').TRPO It is changed into a constraint, The results are similar, but practical application TRPO It's much more difficult .

KL Not the parameter distance, but result Distance of

The specific algorithm is as follows ：

clip： Truncation function , Truncate the fractional function of the first term to 1-\epsilon and 1+\epsilon Between .

A>0 When , We want to try our best to p_\theta / p_\theta^k increase , So it's as big as the red line 1+\epsilon;A《0 When , We want to try our best to p_\theta / p_\theta^k Press small , So try to press it down like the red line 1-\epsilon;

Last show once PPO The results of the method are compared with those of other methods ：

原网站

版权声明
本文为[ViviranZ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/202/202207201609246046.html

当前位置：网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 5

Notes of Teacher Li Hongyi's 2020 in-depth learning series 5

边栏推荐

猜你喜欢

随机推荐