当前位置:网站首页>[don't bother to strengthen learning] video notes (III) 3. SARS (lambda)
[don't bother to strengthen learning] video notes (III) 3. SARS (lambda)
2022-07-24 09:17:00 【Your sister Xuan】
The first 9 section SARSA(lambda)
9.1 SARSA(lambda) brief introduction
Through previous studies , We know what is SARSA, It is a kind of On-Policy( Same strategy ) Of One step update The algorithm of , In the environment , We Every step Just update it once Q surface , So we can say , This update method is called SARSA(0).
Then after one step , Take another step , To update , It can be called SARSA(1). So extreme , Wait until the whole episode Update only after the end , So it's called SARSA(n)( Suppose that episode n End of step ), that lambda Is the number of steps we want to update , Write it down as SARSA(lambda). This lambda Values in 0~1 Between , Express 
9.2 Single step update and round update 
I feel a little confused here , The one-step update and round update I understand may be like this :
- One step update : Update every step , Although it is updated every time , But only the last step is related to rewards , The previous updates are based on the middle step .
- Round update : Although in episode Update after all steps in , But all the steps will be updated , The update is based on the final result .
Like to see , The efficiency of meeting and updating may be higher , Because every step updated is related to the final result .
9.3 Programming to realize
In a nutshell , One step update SARSA(0) It means right The first step Q The renewal of the value , and SARSA(1) Yes, before All steps Update .lambda It can be understood as Decay value of footsteps ( It doesn't mean how much the pace is reduced , It's the attenuation of the update degree of footsteps ), That is, the closer to the goal reward, the more important , The farther away, the less important .
A further understanding is only after writing code ,SARSA(lambda) The pseudo-code is as follows :
It may be more intuitive to explain by program , The program is the same as that written before SARSA The procedure of the method is similar , You can still write in an inherited way , For the convenience of operation , I didn't adopt the way of inheritance . Please refer to the previous code :【 Don't bother to strengthen learning 】 Video notes ( 3、 ... and )2.SARSA Learn to walk the maze
SARSA(lambda) And SARSA The main loop code of is the same , Finally, all the codes will be given , Not discussed in detail here , Only relate to SARSA Different parts of the algorithm .
initialization
Initializing other parameters is almost the same , Added two elements :lambda value and eligibility trace surface ( A form used to record the importance over time ).
def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay = 0.9): # The initial transformation function , Followed by default parameters
self.actions = actions # Action space
self.lr = learning_rate # Learning rate
self.gamma = reward_decay # Reward decay
self.epsilon = e_greedy # greediness
self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64) # initial Q surface
# SARSA(lambda) The other two parameters required in
self.lambda_ = trace_decay # namely lambda, Step attenuation
self.eligibility_trace = self.q_table.copy() # and Q The table is the same E surface , Used to record access marks
Check whether the status exists
There is not much change with the previous function , The main changes are E surface ,E Tables also grow dynamically , From empty table to all States , Finally, you can write code , bring Q Table and E The tables are printed , In this way, we can observe the optimization results .
def check_state_exit(self, state): # Input status
if state not in self.q_table.index: # Q There is no such status in the table
# Insert a new line ,Q The value is initialized to 0
to_be_append = pd.Series([0] * len(self.actions), index=self.q_table.columns, name=state) # Row information to be added
self.q_table = self.q_table.append(to_be_append) # Q Table update
self.eligibility_trace = self.eligibility_trace.append(to_be_append) # E Table update
Learn functions
And SARSA The difference is , Need to be in E Record the steps in the table , Experienced E ( s , a ) E(s,a) E(s,a) Just +1, Indicates that its importance has increased . to update Q After the table ,E The watch also has attenuation (lambda), The importance will decrease with time .
def learn(self, s, a, r, s_, a_):
self.check_state_exit(s_) # Check the status s_ Whether there is ,s_ It is the next state obtained by interacting with the environment after selecting the action
q_predict = self.q_table.loc[s, a] # current state s And the action a Corresponding Q value
if s_ != 'terminal': # If the next step is not the final state
q_target = r + self.gamma * self.q_table.loc[s_, a_] # The next action has been sampled , Use it directly s' And a' Of Q value
else:
q_target = r # Otherwise, it will be immediate return
error = q_target - q_predict # Same as before
self.eligibility_trace.ix[s, a] += 1 # experienced , Just make a mark
self.q_table += self.lr * error * self.eligibility_trace # to update Q(s,a), Here we need to add importance
self.eligibility_trace *= self.gamma * self.lambda_ # E The watch decays over time
here self.eligibility_trace.ix[s, a] += 1 There can also be more efficient ways , It's equivalent to E Make a watch Standardization , because Q There are BIAS Equal error interference :
self.eligibility_trace.ix[s, :] *= 0
self.eligibility_trace.ix[s, a] = 1
Main circulation
Last , The main cycle needs to add a sentence :RL.eligibility_trace *= 0, Before starting a new round, you need to E Table reset .
9.4 understand lambda
You can see from the above code that ,lambda yes E surface Changing over time Recession factor .
to update Q Table time , This sentence self.q_table += self.lr * error * self.eligibility_trace # to update Q(s,a), Here we need to add importance The code indicates , to update Q Table time , Need to multiply “ Importance means ” Of E surface , So those that have not been visited recently Q ( s , a ) Q(s,a) Q(s,a) Basically, it will not be updated , It's like constantly trying to walk to the oasis in the desert , There were footprints 、 The easy way to succeed will certainly be the first choice , However, there is wind in the desert , Footprints will die with the wind , If you don't walk here often , The footprints here will disappear , That is, from the program self.eligibility_trace *= self.gamma * self.lambda_ # E The watch decays over time This code , Every walk will have attenuation . I feel that this is a little similar to ant colony algorithm , It can speed up convergence .
By printing Q Table and E The table clearly shows the process of the algorithm .
9.5 Full code list
SARSAlambda class SARSAlambda.py
import numpy as np
import pandas as pd
class SARSALambda:
def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay = 0.9): # The initial transformation function , Followed by default parameters
self.actions = actions # Action space
self.lr = learning_rate # Learning rate
self.gamma = reward_decay # Reward decay
self.epsilon = e_greedy # greediness
self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64) # initial Q surface
# SARSA(lambda) The other two parameters required in
self.lambda_ = trace_decay # namely lambda, Step attenuation
self.eligibility_trace = self.q_table.copy() # and Q The table is the same E surface , Used to record access marks
def check_state_exit(self, state): # Input status
if state not in self.q_table.index: # Q There is no such status in the table
# Insert a new line ,Q The value is initialized to 0
to_be_append = pd.Series([0] * len(self.actions), index=self.q_table.columns, name=state) # Row information to be added
self.q_table = self.q_table.append(to_be_append) # Q Table update
self.eligibility_trace = self.eligibility_trace.append(to_be_append) # E Table update
def choose_action(self, observation): # Select the action according to the current state
self.check_state_exit(observation) # Check whether the status exists , There is no add to Q In the table
if np.random.uniform() < self.epsilon: # Direct selection Q The most valuable action
state_action = self.q_table.loc[observation, :] # Select the corresponding row
# because Q There may be more than one action with the largest value , We need to choose these actions randomly ( Disorder )
action = np.random.choice(state_action[state_action == np.max(state_action)].index)
else:
action = np.random.choice(self.actions) # Choose an action at random
return action
def learn(self, s, a, r, s_, a_):
self.check_state_exit(s_) # Check the status s_ Whether there is ,s_ It is the next state obtained by interacting with the environment after selecting the action
q_predict = self.q_table.loc[s, a] # current state s And the action a Corresponding Q value
if s_ != 'terminal': # If the next step is not the final state
q_target = r + self.gamma * self.q_table.loc[s_, a_] # The next action has been sampled , Use it directly s' And a' Of Q value
else:
q_target = r # Otherwise, it will be immediate return
error = q_target - q_predict # Same as before
self.eligibility_trace.ix[s, a] += 1 # experienced , Just make a mark
self.q_table += self.lr * error * self.eligibility_trace # to update Q(s,a), Here we need to add importance
self.eligibility_trace *= self.gamma * self.lambda_ # E The watch decays over time
Main circulation main.py
from SARSAlambda import SARSALambda
from maze_env import Maze
def update(): # Update the main function
for episode in range(10): # The number of games played
observation = env.reset() # Initialization environment
action = RL.choose_action(str(observation))
RL.eligibility_trace *= 0 # SARSA(lambda) special , Empty E surface
while True:
env.render() # Refresh image
observation_, reward, done = env.step(action) # Action interacts with the environment , Get the next status 、 Reward value and feedback on whether it is the final state
action_ = RL.choose_action(str(observation_)) # Directly through ε-greedy Get the next action a'
RL.learn(str(observation), action, reward, str(observation_), action_) # to update Q surface
observation = observation_ # Move to the next state
action = action_ # The action is directly the action just now
print(' Print Q surface ')
print(RL.q_table)
print(' Print E surface ')
print(RL.eligibility_trace)
if done:
break
print('Game Over') # Game over
env.destroy() # close window
if __name__ == '__main__':
env = Maze() # Create an environment
RL = SARSALambda(actions=list(range(env.n_actions))) # Q them
env.after(100, update) # 100ms Then call the function
env.mainloop() # Start visualizing the environment
Last one :【 Don't bother to strengthen learning 】 Video notes ( 3、 ... and )2.SARSA Learn to walk the maze
Next :【 Don't bother to strengthen learning 】 Video notes ( Four )1. What is? DQN?
边栏推荐
- 链表——24. 两两交换链表中的节点
- Guys, what parameters can be set when printing flinksql so that the values can be printed? This later section is omitted. It's inconvenient. I read the configuration on the official website
- Tiktok video traffic golden release time
- [assembly language practice] solve the unary quadratic equation ax2+bx+c=0 (including source code and process screenshots, parameters can be modified)
- Paclitaxel loaded tpgs reduced albumin nanoparticles /ga-hsa gambogic acid human serum protein nanoparticles
- JUC强大的辅助类
- Tiktok 16 popular categories, tiktok popular products to see which one you are suitable for?
- Tiktok shop platform will take disciplinary measures against sellers who violate rules and policies
- Scheme and software analysis of dual computer hot standby system "suggestions collection"
- A null pointer exception is reported when the wrapper class inserts into the empty field of the database table
猜你喜欢

代码随想录笔记_链表_25K个一组翻转链表

科目1-2

Little dolphin "transformed" into a new intelligent scheduling engine, which can be explained in simple terms in the practical development and application of DDS

Pulse netizens have a go interview question, can you answer it correctly?

Protocol buffers 的问题和滥用

Android系统安全 — 5.2-APK V1签名介绍
![[assembly language practice] solve the unary quadratic equation ax2+bx+c=0 (including source code and process screenshots, parameters can be modified)](/img/5e/782e5c33accc455994aae044970431.png)
[assembly language practice] solve the unary quadratic equation ax2+bx+c=0 (including source code and process screenshots, parameters can be modified)

Publish your own library on NPM

js定位大全获取节点的兄弟,父级,子级元素含robot实例

DP longest common subsequence detailed version (LCS)
随机推荐
Introduction to common ansible modules
我们说的组件自定义事件到底是什么?
From single architecture to distributed architecture, there are many pits and bugs!
Dorissql syntax Usage Summary
华为无线设备安全策略配置命令
Re6: reading paper licin: a heterogeneous graph based approach for automatic legal stat identification fro
[example of URDF exercise based on ROS] use of four wheeled robot and camera
FreeRTOS - use of software timer
Unity C tool class arrayhelper
C language practice questions + Answers:
Attack and defense world ----- confusion1
Linked list - 24. Exchange nodes in the linked list in pairs
Vim: use tags file to extend the automatic code completion function of YCM for the third-party library of C language
Wenxin big model raises a new "sail", and the tide of industrial application has arrived
Tiktok's "online celebrity" was poached by Amazon and broadcast on Amazon live platform
SQL server2012 installation method details [easy to understand]
CodeBlocks shortcut key operation Xiaoquan
Gnuplot software learning notes
Opencv Chinese document 4.0.0 learning notes (updating...)
排序入门—插入排序和希尔排序