当前位置：网站首页>[don't bother to strengthen learning] video notes (II) 1. What is Q-learning?

[don't bother to strengthen learning] video notes (II) 1. What is Q-learning?

2022-07-24 09:16:00 【Your sister Xuan】

【 Don't bother with reinforcement learning videos 】 The notebook

The first 4 section What is? Q-Learning？

4.1 Q Learn about

Our code of action ： Good behavior can get Reward , Bad behavior will gain punishment .
My name is Xiao Ming , I am a freshman of a University , The first day of class , You don't know that being distracted in class will fail , You sit in the first row , You have two choices ： Listening and wandering , You choose to be distracted continuously , The teacher hung up for you . You have remembered this painful lesson , Continue to listen carefully when repairing （ Of course, this is an extreme case ）.
There is currently a Q surface , As shown in the figure below ：
Insert picture description here Q The table shows each state （ $s_1,s_2,……$ ）, Corresponding to all actions （ $a_1,a_2,……$ ） Of “Q value ”,Q Value can indicate that the corresponding action is selected in the current state Return .
Q What is the function of the table ？
hypothesis Q The table already exists , We choose the initial state as $s_1$ （ Go home from school ）, The choice of action is Q In the table $s_1$ States correspond to Q The most valuable action $a_2$ （ homework ）, Then automatically transfer to the state $s_2$ （ Doing homework ）, Again on the basis of Q Table select action $a_2$ （ homework ）…… In reciprocating .

4.2 Q Table update

Insert picture description here Or the above process , When we go through Q Table select action $a_1$ after , arrive $s_2$ state .
Above picture maxQ( $s_2$ ) It is an estimate made before taking the second action , That is, the maximum possible current state Q value （Q( $s_2$ , $a_2$ )）, Multiply the front by one $\gamma$ , be called Attenuation factor , It expresses the influence of future values on the present , It will be mentioned in detail later . Add a... At the end R, Indicates the current state $s_1$ Next, choose the action $a_2$ Immediate rewards （ Suppose now R by 0, If you don't finish your homework, you won't be rewarded ）, Get real Q( $s_1$ , $a_2$ ) value .Q The original value in the table is the estimated value , disparity （ That is, the part that needs to be adjusted ）= real Q value - original Q value .
Last updated Q value = The original Q value + $\alpha*$ disparity , among $\alpha$ For learning rate （ Affect learning speed ）.

4.3 Q Learning algorithms

Insert picture description here Q In fact, part of the actual value of is also used Q The values in the table are estimated , The update process is the process described above .
The first algorithm 5 That's ok , state s Choose action a when , It uses $\epsilon-greedy$ Method , such as $\epsilon$ =0.9, There is 0.9 The probability of choosing is the greatest Q Value action , But there are 0.1 The probability of choosing any other action , The purpose is to add some randomness , Follow the principle of extensive sampling .

4.4 Attenuation factor $\gamma$

Insert picture description here As shown in the figure above ,Q( $s_1$ ) Our estimate is not only $s_2$ , According to the same rules, it can continue to expand , You can find , Its and subsequent status $s_3、s_4……$ Have a relationship , These can be used to estimate the actual Q value .
When $\gamma=1$ when , Equivalent to more consideration of future rewards , It's not ignored at all .
When $\gamma\subseteq(0,1)$ when , The greater the numerical , The more attention you pay to the future , It can be said that the more intelligent agents “ vision ”.
When $\gamma=0$ when , Completely regardless of the future , Only the current return value .