当前位置:网站首页>[don't bother to strengthen learning] video notes (II) 1. What is Q-learning?
[don't bother to strengthen learning] video notes (II) 1. What is Q-learning?
2022-07-24 09:16:00 【Your sister Xuan】
【 Don't bother with reinforcement learning videos 】 The notebook
The first 4 section What is? Q-Learning?
4.1 Q Learn about
Our code of action : Good behavior can get Reward , Bad behavior will gain punishment .
My name is Xiao Ming , I am a freshman of a University , The first day of class , You don't know that being distracted in class will fail , You sit in the first row , You have two choices : Listening and wandering , You choose to be distracted continuously , The teacher hung up for you . You have remembered this painful lesson , Continue to listen carefully when repairing ( Of course, this is an extreme case ).
There is currently a Q surface , As shown in the figure below :
Q The table shows each state ( s 1 , s 2 , … … s_1,s_2,…… s1,s2,……), Corresponding to all actions ( a 1 , a 2 , … … a_1,a_2,…… a1,a2,……) Of “Q value ”,Q Value can indicate that the corresponding action is selected in the current state Return .
Q What is the function of the table ?
hypothesis Q The table already exists , We choose the initial state as s 1 s_1 s1( Go home from school ), The choice of action is Q In the table s 1 s_1 s1 States correspond to Q The most valuable action a 2 a_2 a2( homework ), Then automatically transfer to the state s 2 s_2 s2( Doing homework ), Again on the basis of Q Table select action a 2 a_2 a2( homework )…… In reciprocating .
4.2 Q Table update
Or the above process , When we go through Q Table select action a 1 a_1 a1 after , arrive s 2 s_2 s2 state .
Above picture maxQ( s 2 s_2 s2) It is an estimate made before taking the second action , That is, the maximum possible current state Q value (Q( s 2 s_2 s2, a 2 a_2 a2)), Multiply the front by one γ \gamma γ, be called Attenuation factor , It expresses the influence of future values on the present , It will be mentioned in detail later . Add a... At the end R, Indicates the current state s 1 s_1 s1 Next, choose the action a 2 a_2 a2 Immediate rewards ( Suppose now R by 0, If you don't finish your homework, you won't be rewarded ), Get real Q( s 1 s_1 s1, a 2 a_2 a2) value .Q The original value in the table is the estimated value , disparity ( That is, the part that needs to be adjusted )= real Q value - original Q value .
Last updated Q value = The original Q value + α ∗ \alpha* α∗ disparity , among α \alpha α For learning rate ( Affect learning speed ).
4.3 Q Learning algorithms
Q In fact, part of the actual value of is also used Q The values in the table are estimated , The update process is the process described above .
The first algorithm 5 That's ok , state s Choose action a when , It uses ϵ − g r e e d y \epsilon-greedy ϵ−greedy Method , such as ϵ \epsilon ϵ=0.9, There is 0.9 The probability of choosing is the greatest Q Value action , But there are 0.1 The probability of choosing any other action , The purpose is to add some randomness , Follow the principle of extensive sampling .
4.4 Attenuation factor γ \gamma γ
As shown in the figure above ,Q( s 1 s_1 s1) Our estimate is not only s 2 s_2 s2, According to the same rules, it can continue to expand , You can find , Its and subsequent status s 3 、 s 4 … … s_3、s_4…… s3、s4…… Have a relationship , These can be used to estimate the actual Q value .
When γ = 1 \gamma=1 γ=1 when , Equivalent to more consideration of future rewards , It's not ignored at all .
When γ ⊆ ( 0 , 1 ) \gamma\subseteq(0,1) γ⊆(0,1) when , The greater the numerical , The more attention you pay to the future , It can be said that the more intelligent agents “ vision ”.
When γ = 0 \gamma=0 γ=0 when , Completely regardless of the future , Only the current return value .
Last one :【 Don't bother to strengthen learning 】 Video notes ( One )3. Why use reinforcement learning ?
Next :【 Don't bother to strengthen learning 】 Video notes ( Two )2. Write a Q Small examples of learning
边栏推荐
- How should tiktok account operate?
- 链表——24. 两两交换链表中的节点
- Matlab各函数说明
- C语言练习题目+答案:
- 03_ UE4 advanced_ illumination
- Leetcode102-二叉树的层序遍历详解
- Aruba学习笔记06-无线控制AC基础配置(CLI)
- CUDA day 2: GPU core and Sm core components [easy to understand]
- Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
- xtrabackup 实现mysql的全量备份与增量备份
猜你喜欢

How to judge and analyze NFT market briefly through NFT go?

Account 1-3

Aruba学习笔记06-无线控制AC基础配置(CLI)

Android system security - 5.3-apk V2 signature introduction

科目1-3

【汇编语言实战】(二)、编写一程序计算表达式w=v-(x+y+z-51)的值(含代码、过程截图)

Re6: reading paper licin: a heterogeneous graph based approach for automatic legal stat identification fro

Un7.22: how to upload videos and pictures simultaneously with the ruoyi framework in idea and vs Code?

Tang Yudi opencv background modeling

Protocol buffers 的问题和滥用
随机推荐
TCP triple handshake connection combing
Tiktok video traffic golden release time
Houdini notes
CodeBlocks shortcut key operation Xiaoquan
科目1-3
Xtrabackup realizes full backup and incremental backup of MySQL
What is the "age limit" on tiktok and how to solve it?
Paclitaxel loaded tpgs reduced albumin nanoparticles /ga-hsa gambogic acid human serum protein nanoparticles
Six pictures show you why TCP shakes three times?
Introduction to common ansible modules
Tiktok live broadcast with goods marketing play
Gnuplot software learning notes
Nuxt 路由切换后 asyncData 跨域报错
Sword finger offer II 024. reverse linked list
Es document CRUD
How can tiktok transport videos not be streaming limited?
Little dolphin "transformed" into a new intelligent scheduling engine, which can be explained in simple terms in the practical development and application of DDS
LeetCode刷题系列-- 174. 地下城游戏
[the first anniversary of my creation] love needs to be commemorated, so does creation
One year after I came to Ali, I ushered in my first job change