当前位置:网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 4
Notes of Teacher Li Hongyi's 2020 in-depth learning series 4
2022-07-24 22:59:00 【ViviranZ】
Look blind .... At least take a note
https://www.bilibili.com/video/BV1UE411G78S?from=search&
Finally, it comes to PPO Hahahahahaha, super funny, super funny
First of all, let's give the basic elements , We are still familiar with actor、environment、reward function

and policy~

Next, let's talk about the process , The observed s_1→ To make a a_1→ obtain r_1→ Observe new s_2→……

Generally speaking ~s_2 and s_1、a_1 It's all about , And generally speaking, it is a distribution Instead of a certain value ( Playing games doesn't mean that the result of doing an action on this page is given , That's too boring !)

and reward Not necessarily ~ So we aim at this problem , The calculation is not a reward It is reward The expectations of the !( Many trajectory The average of )

The specific methods ( Formula derivation is mentioned in the second note )

Implementation ideas ( Review )

What we use as a control is sampling Result
tips:1.baseline: because reward Yes no negative Maybe it's not very good action increase probability Will result in reward increase , So by add One baseline Only Gaby baseline Big , Small ( Because multiply by a negative number ) The probability will decrease

2. There may be “ Loser MVP” The phenomenon : Although some always reward Not good, but one step action good ; Although some reward High but bad action.
Solutions : Because every road is different action If you use them all (R(\tao^n)-b) The same weight will cause average but unfair , So we use all after this step reward Not the whole trajectory Of reward As weights

And we're going to have to take a discount factor(\gamma)
1. Itself away from here action Farther and action The smaller the relationship
2. People prefer to get the reward as soon as possible

边栏推荐
- Burp's thinking from tracing to counteracting
- "Fundamentals of program design" Chapter 10 function and program structure 7-2 Hanoi Tower problem (20 points)
- Three ways of shell debugging and debugging
- P3201 [hnoi2009] dream pudding heuristic merge
- EL & JSTL: JSTL summary
- Xiezhendong: Exploration and practice of digital transformation and upgrading of public transport industry
- 买收益百分之6的理财产品,需要开户吗?
- 【云原生之kubernetes】kubernetes集群高级资源对象statefulesets
- Segment tree,,
- 芯片的功耗
猜你喜欢

基于Verilog HDL的数字秒表

Convex optimization Basics

CA证书制作实战

Network Security Learning (II) IP address

Talk about how redis handles requests

Notes of Teacher Li Hongyi's 2020 in-depth learning series lecture 1

价值驱动为商业BP转型提供核心动力——业务场景下的BP实现-商业BP分享

Network Security Learning (III) basic DOS commands

凸优化基础知识

力扣 1184. 公交站间的距离
随机推荐
Pointrender parsing
Understanding complexity and simple sorting operation
Three ways of shell debugging and debugging
认识复杂度和简单排序运算
How to create and manage customized configuration information
Notes of Teacher Li Hongyi's 2020 in-depth learning series 2
ASP. Net core 6.0 data validation based on model validation
用VS Code搞Qt6:编译源代码与基本配置
CA证书制作实战
Luogu p2024 [noi2001] food chain
About constant modifier const
Read and understand the advantages of the LAAS scheme of elephant swap
VC prompts to recompile every time you press F5 to run
Flex layout
[cloud native] Devops (IV): integrated sonar Qube
一文读懂Elephant Swap的LaaS方案的优势之处
Org.json Jsonexception: what about no value for value
高阶产品如何提出有效解决方案?(1方法论+2案例+1清单)
Li Kou 1184. Distance between bus stops
凸优化基础知识