当前位置:网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 5
Notes of Teacher Li Hongyi's 2020 in-depth learning series 5
2022-07-24 22:59:00 【ViviranZ】
Look blind .... At least take a note
https://www.bilibili.com/video/BV1UE411G78S?from=search&
1. On-policy & off-policy
If you are learning while playing on If you stand by and watch others play and learn, it is off
Make yourself at home , Is it the difference between increasing experience while fighting monsters and seeing how others fight monsters and steal teachers first ?

We used it before gradient descent yes on-policy Of , This is because we have to update every time \theta, So we are constantly updating ourselves as we move forward . But we want to try to use off-policy, Because in this case, everyone data( state 、 Functions and choices ) Can be used many times ,(on-policy Every operation of is used up and immediately discarded )
A trick :【importance sampling】( No RL A dedicated , Very common 】
A lot of times , What we want to study is real x Of distribution/p(x)/ And what we can get x' Of q(x) It's different , So we need to treat the original E(X) Modification of calculation formula ( Otherwise, the result is and q(x) Relevant but not embodied p(x) Characteristics ).
The formula we use is f(x)*p(x)/q(x) stay q(x) Decisive trajectory The way to expect , The key is to be q(x)=0 At the time of the x Need to meet p(x)=0, That is, for q(x) There are some limitations .

Besides , Before we talk about why we should use importance sampling Let's talk about some related knowledge before . in application ,p(x) and q(x) Still can't make a big difference , This is because although the two expectations are the same, there are differences in variance and other information . Therefore, once the sampling is not enough or not good enough, it is likely that the results will be very different due to variance .

A small summary :

Let's push the formula again

The main application is derivative And the formula of conditional probability , Needless to say . The key lies in the proportion of red thread spent , There are three ways to understand why it can be removed :
1. In itself state The probability of occurrence and trajectory and action It doesn't matter , So we can ignore
2. There are some in practical application state Even only once ( For example, image recognition ) So it's hard to find p_\theta and p_\theta', So just brainwash yourself and say it's not important = =
3. It can be understood as a and s Independent , Therefore, the conditional probability and the joint probability are equal ( In essence 1.)
How to ensure \theta and \theta' Similar enough ??——————PPO!
Let's talk about it PPO And its predecessor TRPO, The specific method is to add a similar regularation(ML Of ) The item \beta KL(\theta,\theta').TRPO It is changed into a constraint, The results are similar, but practical application TRPO It's much more difficult .
KL Not the parameter distance, but result Distance of

The specific algorithm is as follows :


clip: Truncation function , Truncate the fractional function of the first term to 1-\epsilon and 1+\epsilon Between .
A>0 When , We want to try our best to p_\theta / p_\theta^k increase , So it's as big as the red line 1+\epsilon;A《0 When , We want to try our best to p_\theta / p_\theta^k Press small , So try to press it down like the red line 1-\epsilon;
Last show once PPO The results of the method are compared with those of other methods :

边栏推荐
- 中金证券新课理财产品的收益有百分之六吗?我想要开户理财
- Notes of Teacher Li Hongyi's 2020 in-depth learning series 2
- How to speculate on the Internet? Is it safe to speculate on mobile phones
- 先工程实践,还是先工程思想?—— 一位本科生从学oi到学开发的感悟
- Network Security Learning (V) DHCP
- 网上怎么炒股手机上炒股安全吗
- Things to study
- 把字符串转换成整数与不要二
- Collection of common online testing tools
- [cloud native kubernetes] kubernetes cluster advanced resource object staterulesets
猜你喜欢

AVL tree of ordered table

把字符串转换成整数与不要二

Pointrender parsing

用VS Code搞Qt6:编译源代码与基本配置

Old Du servlet JSP

Outlook邮件创建的规则失效,可能的原因

First engineering practice, or first engineering thought—— An undergraduate's perception from learning oi to learning development

IndexTree

QT6 with vs Code: compiling source code and basic configuration

【1184. 公交站间的距离】
随机推荐
How static code analysis works
VC prompts to recompile every time you press F5 to run
P3201 [HNOI2009] 梦幻布丁 启发式合并
Talk about how redis handles requests
理财产品可以达到百分之6的,我想要开户买理财产品
IndexTree
Understanding complexity and simple sorting operation
Outlook邮件创建的规则失效,可能的原因
Three ways of shell debugging and debugging
Monotonic stack structure exercise -- cumulative sum of minimum values of subarrays
痞子衡嵌入式:MCUXpresso IDE下将源码制作成Lib库方法及其与IAR,MDK差异
Notes of Teacher Li Hongyi's 2020 in-depth learning series 2
百度网盘+Chrom插件
AVL tree of ordered table
[cloud native kubernetes] kubernetes cluster advanced resource object staterulesets
Luogu p2024 [noi2001] food chain
Alibaba cloud SSL certificate
Piziheng embedded: the method of making source code into lib Library under MCU Xpress IDE and its difference with IAR and MDK
RichTextBox save as picture
Filter list