当前位置:网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 4
Notes of Teacher Li Hongyi's 2020 in-depth learning series 4
2022-07-24 22:59:00 【ViviranZ】
Look blind .... At least take a note
https://www.bilibili.com/video/BV1UE411G78S?from=search&
Finally, it comes to PPO Hahahahahaha, super funny, super funny
First of all, let's give the basic elements , We are still familiar with actor、environment、reward function

and policy~

Next, let's talk about the process , The observed s_1→ To make a a_1→ obtain r_1→ Observe new s_2→……

Generally speaking ~s_2 and s_1、a_1 It's all about , And generally speaking, it is a distribution Instead of a certain value ( Playing games doesn't mean that the result of doing an action on this page is given , That's too boring !)

and reward Not necessarily ~ So we aim at this problem , The calculation is not a reward It is reward The expectations of the !( Many trajectory The average of )

The specific methods ( Formula derivation is mentioned in the second note )

Implementation ideas ( Review )

What we use as a control is sampling Result
tips:1.baseline: because reward Yes no negative Maybe it's not very good action increase probability Will result in reward increase , So by add One baseline Only Gaby baseline Big , Small ( Because multiply by a negative number ) The probability will decrease

2. There may be “ Loser MVP” The phenomenon : Although some always reward Not good, but one step action good ; Although some reward High but bad action.
Solutions : Because every road is different action If you use them all (R(\tao^n)-b) The same weight will cause average but unfair , So we use all after this step reward Not the whole trajectory Of reward As weights

And we're going to have to take a discount factor(\gamma)
1. Itself away from here action Farther and action The smaller the relationship
2. People prefer to get the reward as soon as possible

边栏推荐
- How static code analysis works
- Three ways of shell debugging and debugging
- From violent recursion to dynamic programming, memory search
- 老杜Servlet-JSP
- TrinityCore魔兽世界服务器-注册网站
- Shardingsphere database sub database sub table introduction
- Network Security Learning (V) DHCP
- Power consumption of chip
- Talk about how redis handles requests
- Shell调试Debug的三种方式
猜你喜欢

Convex optimization Basics

MySQL查询慢的一些分析

First engineering practice, or first engineering thought—— An undergraduate's perception from learning oi to learning development

HLS编程入门

Old Du servlet JSP

Error connecting MySQL database with kettle

Pointrender parsing

Notes of Teacher Li Hongyi's 2020 in-depth learning series 2
WPF uses pathgeometry to draw the hour hand and minute hand

用VS Code搞Qt6:编译源代码与基本配置
随机推荐
WPF opens external programs and activates them when needed
When texturebrush is created, it prompts that there is insufficient memory
【1184. 公交站间的距离】
ASP. Net core 6.0 data validation based on model validation
JUC concurrent programming - Advanced 05 - lock free of shared model (CAS | atomic integer | atomic reference | atomic array | field updater | atomic accumulator | unsafe class)
Filter list
背景图和二维码合成
生成式对抗网络的效果评估
What is a video content recommendation engine?
Oracle中实现对指定数据分组且获取重复次数
工业物联网中的时序数据
CA证书制作实战
用VS Code搞Qt6:编译源代码与基本配置
How to propose effective solutions for high-end products? (1 methodology + 2 cases + 1 List)
Lidar obstacle detection and tracking: CUDA European clustering
Is it safe for Guosen Securities to open a mobile account
激光雷达障碍物检测与追踪实战——cuda版欧式聚类
Notes of Teacher Li Hongyi's 2020 in-depth learning series 2
Network Security Learning (II) IP address
VGA display based on FPGA