当前位置：网站首页>[don't bother to strengthen learning] video notes (III) 3. SARS (lambda)

[don't bother to strengthen learning] video notes (III) 3. SARS (lambda)

2022-07-24 09:17:00 【Your sister Xuan】

The first 9 section SARSA(lambda)

9.1 SARSA(lambda) brief introduction

Through previous studies , We know what is SARSA, It is a kind of On-Policy（ Same strategy ） Of One step update The algorithm of , In the environment , We Every step Just update it once Q surface , So we can say , This update method is called SARSA(0).
Then after one step , Take another step , To update , It can be called SARSA(1). So extreme , Wait until the whole episode Update only after the end , So it's called SARSA(n)（ Suppose that episode n End of step ）, that lambda Is the number of steps we want to update , Write it down as SARSA(lambda). This lambda Values in 0~1 Between , Express
Insert picture description here

9.2 Single step update and round update

I feel a little confused here , The one-step update and round update I understand may be like this ：

One step update ： Update every step , Although it is updated every time , But only the last step is related to rewards , The previous updates are based on the middle step .
Round update ： Although in episode Update after all steps in , But all the steps will be updated , The update is based on the final result .

Like to see , The efficiency of meeting and updating may be higher , Because every step updated is related to the final result .

9.3 Programming to realize

In a nutshell , One step update SARSA(0) It means right The first step Q The renewal of the value , and SARSA(1) Yes, before All steps Update .lambda It can be understood as Decay value of footsteps （ It doesn't mean how much the pace is reduced , It's the attenuation of the update degree of footsteps ）, That is, the closer to the goal reward, the more important , The farther away, the less important .
A further understanding is only after writing code ,SARSA(lambda) The pseudo-code is as follows ：
Insert picture description here It may be more intuitive to explain by program , The program is the same as that written before SARSA The procedure of the method is similar , You can still write in an inherited way , For the convenience of operation , I didn't adopt the way of inheritance . Please refer to the previous code ：【 Don't bother to strengthen learning 】 Video notes （ 3、 ... and ）2.SARSA Learn to walk the maze
SARSA(lambda) And SARSA The main loop code of is the same , Finally, all the codes will be given , Not discussed in detail here , Only relate to SARSA Different parts of the algorithm .

initialization

Initializing other parameters is almost the same , Added two elements ：lambda value and eligibility trace surface （ A form used to record the importance over time ）.

    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay = 0.9):  #  The initial transformation function , Followed by default parameters 
        self.actions = actions  #  Action space 
        self.lr = learning_rate  #  Learning rate 
        self.gamma = reward_decay  #  Reward decay 
        self.epsilon = e_greedy  #  greediness 
        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)  #  initial Q surface 
        # SARSA(lambda) The other two parameters required in 
        self.lambda_ = trace_decay  #  namely lambda, Step attenuation 
        self.eligibility_trace = self.q_table.copy()  #  and Q The table is the same E surface , Used to record access marks

Check whether the status exists

There is not much change with the previous function , The main changes are E surface ,E Tables also grow dynamically , From empty table to all States , Finally, you can write code , bring Q Table and E The tables are printed , In this way, we can observe the optimization results .

    def check_state_exit(self, state):  #  Input status 
        if state not in self.q_table.index:  # Q There is no such status in the table 
            #  Insert a new line ,Q The value is initialized to 0
            to_be_append = pd.Series([0] * len(self.actions), index=self.q_table.columns, name=state)  #  Row information to be added 
            self.q_table = self.q_table.append(to_be_append)  # Q Table update 
            self.eligibility_trace = self.eligibility_trace.append(to_be_append)  # E Table update

Learn functions

And SARSA The difference is , Need to be in E Record the steps in the table , Experienced $E (s, a)$ Just +1, Indicates that its importance has increased . to update Q After the table ,E The watch also has attenuation (lambda), The importance will decrease with time .

    def learn(self, s, a, r, s_, a_):
        self.check_state_exit(s_)  #  Check the status s_ Whether there is ,s_ It is the next state obtained by interacting with the environment after selecting the action 
        q_predict = self.q_table.loc[s, a]  #  current state s And the action a Corresponding Q value 
        if s_ != 'terminal':  #  If the next step is not the final state 
            q_target = r + self.gamma * self.q_table.loc[s_, a_]  #  The next action has been sampled , Use it directly s' And a' Of Q value 
        else:
            q_target = r  #  Otherwise, it will be immediate return 
        error = q_target - q_predict  #  Same as before 

        self.eligibility_trace.ix[s, a] += 1  #  experienced , Just make a mark 
        self.q_table += self.lr * error * self.eligibility_trace  #  to update Q(s,a), Here we need to add importance 
        self.eligibility_trace *= self.gamma * self.lambda_  # E The watch decays over time

here self.eligibility_trace.ix[s, a] += 1 There can also be more efficient ways , It's equivalent to E Make a watch Standardization , because Q There are BIAS Equal error interference ：

self.eligibility_trace.ix[s, :] *= 0
self.eligibility_trace.ix[s, a] = 1

Main circulation

Last , The main cycle needs to add a sentence ：RL.eligibility_trace *= 0, Before starting a new round, you need to E Table reset .

9.4 understand lambda

You can see from the above code that ,lambda yes E surface Changing over time Recession factor .
to update Q Table time , This sentence self.q_table += self.lr * error * self.eligibility_trace # to update Q(s,a), Here we need to add importance The code indicates , to update Q Table time , Need to multiply “ Importance means ” Of E surface , So those that have not been visited recently $Q (s, a)$ Basically, it will not be updated , It's like constantly trying to walk to the oasis in the desert , There were footprints 、 The easy way to succeed will certainly be the first choice , However, there is wind in the desert , Footprints will die with the wind , If you don't walk here often , The footprints here will disappear , That is, from the program self.eligibility_trace *= self.gamma * self.lambda_ # E The watch decays over time This code , Every walk will have attenuation . I feel that this is a little similar to ant colony algorithm , It can speed up convergence .
By printing Q Table and E The table clearly shows the process of the algorithm .

9.5 Full code list

SARSAlambda class SARSAlambda.py

import numpy as np
import pandas as pd


class SARSALambda:
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay = 0.9):  #  The initial transformation function , Followed by default parameters 
        self.actions = actions  #  Action space 
        self.lr = learning_rate  #  Learning rate 
        self.gamma = reward_decay  #  Reward decay 
        self.epsilon = e_greedy  #  greediness 
        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)  #  initial Q surface 
        # SARSA(lambda) The other two parameters required in 
        self.lambda_ = trace_decay  #  namely lambda, Step attenuation 
        self.eligibility_trace = self.q_table.copy()  #  and Q The table is the same E surface , Used to record access marks 

    def check_state_exit(self, state):  #  Input status 
        if state not in self.q_table.index:  # Q There is no such status in the table 
            #  Insert a new line ,Q The value is initialized to 0
            to_be_append = pd.Series([0] * len(self.actions), index=self.q_table.columns, name=state)  #  Row information to be added 
            self.q_table = self.q_table.append(to_be_append)  # Q Table update 
            self.eligibility_trace = self.eligibility_trace.append(to_be_append)  # E Table update 

    def choose_action(self, observation):  #  Select the action according to the current state 
        self.check_state_exit(observation)  #  Check whether the status exists , There is no add to Q In the table 
        if np.random.uniform() < self.epsilon:  #  Direct selection Q The most valuable action 
            state_action = self.q_table.loc[observation, :]  #  Select the corresponding row 
            #  because Q There may be more than one action with the largest value , We need to choose these actions randomly （ Disorder ）
            action = np.random.choice(state_action[state_action == np.max(state_action)].index)
        else:
            action = np.random.choice(self.actions)  #  Choose an action at random 
        return action

    def learn(self, s, a, r, s_, a_):
        self.check_state_exit(s_)  #  Check the status s_ Whether there is ,s_ It is the next state obtained by interacting with the environment after selecting the action 
        q_predict = self.q_table.loc[s, a]  #  current state s And the action a Corresponding Q value 
        if s_ != 'terminal':  #  If the next step is not the final state 
            q_target = r + self.gamma * self.q_table.loc[s_, a_]  #  The next action has been sampled , Use it directly s' And a' Of Q value 
        else:
            q_target = r  #  Otherwise, it will be immediate return 
        error = q_target - q_predict  #  Same as before 

        self.eligibility_trace.ix[s, a] += 1  #  experienced , Just make a mark 
        self.q_table += self.lr * error * self.eligibility_trace  #  to update Q(s,a), Here we need to add importance 
        self.eligibility_trace *= self.gamma * self.lambda_  # E The watch decays over time

Main circulation main.py

from SARSAlambda import SARSALambda
from maze_env import Maze


def update():  #  Update the main function 
    for episode in range(10):  #  The number of games played 
        observation = env.reset()  #  Initialization environment 
        action = RL.choose_action(str(observation))
        RL.eligibility_trace *= 0  # SARSA(lambda) special , Empty E surface 
        while True:
            env.render()  #  Refresh image 
            observation_, reward, done = env.step(action)  #  Action interacts with the environment , Get the next status 、 Reward value and feedback on whether it is the final state 
            action_ = RL.choose_action(str(observation_))  #  Directly through ε-greedy Get the next action a'
            RL.learn(str(observation), action, reward, str(observation_), action_)  #  to update Q surface 
            observation = observation_  #  Move to the next state 
            action = action_  #  The action is directly the action just now 
            print(' Print Q surface ')
            print(RL.q_table)
            print(' Print E surface ')
            print(RL.eligibility_trace)
            if done:
                break
    print('Game Over')  #  Game over 
    env.destroy()  #  close window 


if __name__ == '__main__':
    env = Maze()  #  Create an environment 
    RL = SARSALambda(actions=list(range(env.n_actions)))  # Q them 

    env.after(100, update)  # 100ms Then call the function 
    env.mainloop()  #  Start visualizing the environment