当前位置:网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 6
Notes of Teacher Li Hongyi's 2020 in-depth learning series 6
2022-07-24 22:59:00 【ViviranZ】
Look blind .... At least take a note
https://www.bilibili.com/video/BV1UE411G78S?from=search&
Q-learning:
First of all, review critic: Responsible for one actor Scoring , When actor At some point state When ,critic You can calculate the possible expectations in the future . Be careful :critic Give a score and actor(policy\pi) The binding of , The same state Different actor Words critic Will give different reward expect .

Review by the way critic Scoring method :
1.MC Method :
Count every time accumulated reward, Therefore, you must play until the end of the game update network.

2.TD Method :
There is no need to play to the end , As long as it's over s_t Get into s_{t+1} You can update

distinguish between :
because MC Go on until the end of the game , Every step will have variance, Multi step is secondary accumulation , So the variance is very large .
TD Just count back one step ,variance Small ; However, due to the small number of calculation steps, it will cause inaccuracy .

An example :
In the calculation s_a When , If we put sa As a waiting state, stay MC And finally get sb Prove it reward Should be 0; But in TD The method says sb This is just a coincidence , Maybe I met sb obtain 0 the 1/4, And look forward to sb Should be 3/4, That is to say sa Get what you deserve reward.

except MC and TD Other than critic-Q^\pi(s,a), Specifically, it means meeting state s It is time to enforce action a, Throw the rest to agent According to \pi Come and go . Also have discrete But it only applies to a limited number of action.

Give a chestnut. :

As long as there is one Q function And any one of them policy\pi You can always find a better one than \pi better policy \pi', Ask again Q, Update again \pi', It always gets better .

What is called “ good ”?
Is in state s Consider all possible action a, Find the largest one defined as \pi' Corresponding action a.

Next, prove a prop: As long as there is one state take \pi'(s) A given action, No matter which route is adopted policy\pi, Will make the last Q Value increases —— Better !

Here are a few tips:
actor-critic Two networks in the network . One is only responsible for walking and usually doesn't move (target network); Another crazy move (……) Be responsible for producing as good as possible action, Crazy exploration to find the best next step back to target,target Walk to the next state Explore crazily again ....( It's bad luck )

The second is if agent Not in the state s done action a, It may not be possible to calculate Q value , Can only estimate . If Q It's a network It's good to say (Q How is network? Don't understand, ), however generally That's a problem .agent It is possible to do a good job and keep doing it , But maybe something else is better ... So we give \epsilon-greedy Algorithm and the second ~

Third : About storage .
We hope Buffer Try to store data in diversed.Buffer Many of the storage in the are used before \pi' The data of - Increase diversity . It is worth noting that , use \pi' Data to calculate \pi Will it cause problems ? The answer is no ( The reason is left for thinking ?)

The overall summary Q-learning The algorithm of :
Be careful :
1.store if buffer Throw one out when it's full .
2.sample It's a group (batch) It's not a piece of (notation It may not be clear )

!! The key is to think clearly ,Q-learning、RL Our goal is to find the best Q!!!!
How to explore another network if it is a continuous action space ( Explore exploration)????
Q-learning There are some problems with the method , So there is Double DQN……( To be continued )

边栏推荐
- 痞子衡嵌入式:MCUXpresso IDE下将源码制作成Lib库方法及其与IAR,MDK差异
- Shell调试Debug的三种方式
- [cloud native kubernetes] kubernetes cluster advanced resource object staterulesets
- 【云原生之kubernetes】kubernetes集群高级资源对象statefulesets
- 老杜Servlet-JSP
- JDBC 驱动升级到 Version 8.0.28 连接 MySQL 的踩坑记录
- What is a video content recommendation engine?
- How about Minsheng futures? Is it safe?
- Notes of Teacher Li Hongyi's 2020 in-depth learning series 2
- ShardingSphere-数据库分库分表简介
猜你喜欢

Power consumption of chip

Read and understand the advantages of the LAAS scheme of elephant swap

高阶产品如何提出有效解决方案?(1方法论+2案例+1清单)

AVL tree of ordered table

聊聊 Redis 是如何进行请求处理

Org.json Jsonexception: what about no value for value

IP first experiment hdcl encapsulates PPP, chap, mGRE

基于FPGA的VGA显示

基于Verilog HDL的数字秒表

How to propose effective solutions for high-end products? (1 methodology + 2 cases + 1 List)
随机推荐
Can the income of CICC securities' new customer financial products reach 6%? How to open an account?
Network Security Learning (III) basic DOS commands
P3201 [hnoi2009] dream pudding heuristic merge
WPF opens external programs and activates them when needed
CA证书制作实战
Org.json Jsonexception: what about no value for value
"Fundamentals of program design" Chapter 10 function and program structure 7-3 recursive realization of reverse order output integer (15 points)
"Fundamentals of program design" Chapter 10 function and program structure 6-13 divide and conquer method to solve the gold bullion problem (20 points)
Background image and QR code synthesis
First engineering practice, or first engineering thought—— An undergraduate's perception from learning oi to learning development
DDoS attack classification
Xiezhendong: Exploration and practice of digital transformation and upgrading of public transport industry
Let‘s Encrypt
"Fundamentals of program design" Chapter 10 function and program structure 7-2 Hanoi Tower problem (20 points)
代码覆盖率
理财产品可以达到百分之6的,我想要开户买理财产品
QT6 with vs Code: compiling source code and basic configuration
价值驱动为商业BP转型提供核心动力——业务场景下的BP实现-商业BP分享
痞子衡嵌入式:MCUXpresso IDE下将源码制作成Lib库方法及其与IAR,MDK差异
When texturebrush is created, it prompts that there is insufficient memory