当前位置：网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 4

Notes of Teacher Li Hongyi's 2020 in-depth learning series 4

2022-07-24 22:59:00 【ViviranZ】

Look blind .... At least take a note

https://www.bilibili.com/video/BV1UE411G78S?from=search&amp;

Finally, it comes to PPO Hahahahahaha, super funny, super funny

First of all, let's give the basic elements , We are still familiar with actor、environment、reward function

and policy~

Next, let's talk about the process , The observed s_1→ To make a a_1→ obtain r_1→ Observe new s_2→……

Generally speaking ~s_2 and s_1、a_1 It's all about , And generally speaking, it is a distribution Instead of a certain value （ Playing games doesn't mean that the result of doing an action on this page is given , That's too boring ！）

and reward Not necessarily ~ So we aim at this problem , The calculation is not a reward It is reward The expectations of the ！（ Many trajectory The average of ）

The specific methods （ Formula derivation is mentioned in the second note ）

Implementation ideas （ Review )

What we use as a control is sampling Result

tips:1.baseline： because reward Yes no negative Maybe it's not very good action increase probability Will result in reward increase , So by add One baseline Only Gaby baseline Big , Small （ Because multiply by a negative number ） The probability will decrease

2. There may be “ Loser MVP” The phenomenon ： Although some always reward Not good, but one step action good ; Although some reward High but bad action.

Solutions ： Because every road is different action If you use them all （R（\tao^n)-b） The same weight will cause average but unfair , So we use all after this step reward Not the whole trajectory Of reward As weights