当前位置：网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 6

Notes of Teacher Li Hongyi's 2020 in-depth learning series 6

2022-07-24 22:59:00 【ViviranZ】

Look blind .... At least take a note

https://www.bilibili.com/video/BV1UE411G78S?from=search&amp;

Q-learning：

First of all, review critic： Responsible for one actor Scoring , When actor At some point state When ,critic You can calculate the possible expectations in the future . Be careful ：critic Give a score and actor（policy\pi） The binding of , The same state Different actor Words critic Will give different reward expect .

Review by the way critic Scoring method ：

1.MC Method ：

Count every time accumulated reward, Therefore, you must play until the end of the game update network.

2.TD Method ：

There is no need to play to the end , As long as it's over s_t Get into s_{t+1} You can update

distinguish between ：

because MC Go on until the end of the game , Every step will have variance, Multi step is secondary accumulation , So the variance is very large .

TD Just count back one step ,variance Small ; However, due to the small number of calculation steps, it will cause inaccuracy .

An example ：

In the calculation s_a When , If we put sa As a waiting state, stay MC And finally get sb Prove it reward Should be 0; But in TD The method says sb This is just a coincidence , Maybe I met sb obtain 0 the 1/4, And look forward to sb Should be 3/4, That is to say sa Get what you deserve reward.

except MC and TD Other than critic-Q^\pi(s,a), Specifically, it means meeting state s It is time to enforce action a, Throw the rest to agent According to \pi Come and go . Also have discrete But it only applies to a limited number of action.

Give a chestnut. ：

As long as there is one Q function And any one of them policy\pi You can always find a better one than \pi better policy \pi', Ask again Q, Update again \pi', It always gets better .

What is called “ good ”？

Is in state s Consider all possible action a, Find the largest one defined as \pi' Corresponding action a.

Next, prove a prop： As long as there is one state take \pi'(s) A given action, No matter which route is adopted policy\pi, Will make the last Q Value increases —— Better ！

Here are a few tips：

actor-critic Two networks in the network . One is only responsible for walking and usually doesn't move （target network); Another crazy move （……） Be responsible for producing as good as possible action, Crazy exploration to find the best next step back to target,target Walk to the next state Explore crazily again ....（ It's bad luck ）

The second is if agent Not in the state s done action a, It may not be possible to calculate Q value , Can only estimate . If Q It's a network It's good to say （Q How is network？ Don't understand, ）, however generally That's a problem .agent It is possible to do a good job and keep doing it , But maybe something else is better ... So we give \epsilon-greedy Algorithm and the second ~

Third ： About storage .

We hope Buffer Try to store data in diversed.Buffer Many of the storage in the are used before \pi' The data of - Increase diversity . It is worth noting that , use \pi' Data to calculate \pi Will it cause problems ？ The answer is no （ The reason is left for thinking ？）

The overall summary Q-learning The algorithm of ：

Be careful ：

1.store if buffer Throw one out when it's full .

2.sample It's a group （batch） It's not a piece of （notation It may not be clear ）

！！ The key is to think clearly ,Q-learning、RL Our goal is to find the best Q！！！！

How to explore another network if it is a continuous action space （ Explore exploration）？？？？

Q-learning There are some problems with the method , So there is Double DQN……（ To be continued ）

原网站

版权声明
本文为[ViviranZ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/202/202207201609245975.html

当前位置：网站首页>Notes of Teacher Li Hongyi's 2020 in-depth learning series 6

Notes of Teacher Li Hongyi's 2020 in-depth learning series 6

边栏推荐

猜你喜欢

随机推荐