当前位置：网站首页>Reinforcement learning series (I) -- basic concepts

Reinforcement learning series (I) -- basic concepts

2022-06-23 17:51:00 【languageX】

Recently learned about reinforcement learning , Prepare to sort out and summarize . This paper first introduces some basic concepts in reinforcement learning .

Reinforcement learning

Reinforcement learning , Supervised learning , Unsupervised learning is one of the three learning methods of machine learning .

Reinforcement learning is that agents try and learn by constantly exploring the unknown and using the existing knowledge in the environment , Then through the learned strategies （ What behavior should be taken in what state ） Carry out a series of actions to obtain the maximum cumulative income in the long term .

Supervised learning , Unsupervised learning , The difference between reinforcement learning

Supervised learning requires training data to have inputs and tags , Learn the expected output of input from the tag . Reinforcement learning has no tag value , Only encouragement and punishment , Need to constantly interact with the environment , Learn the best strategy by trial and error .

Unsupervised learning has no label or reward value , But by learning the hidden features of the data , Usually used for clustering . Reinforcement learning needs a feedback .

There is no sequence dependency between supervised learning and unsupervised learning , The reward calculation of reinforcement learning is sequence dependent , It is a delayed return .

Markov decision making process （MDP）

Let's first understand MDP, He is a theoretical basis for reinforcement learning . For us to understand the decision-making of agents in reinforcement learning , Have a clearer understanding of concepts such as value function .

MDP Consisting of four tuples , M=(S,A,P,R)

S: Represents a state set , among S_i It means the first one i The state of a moment ,s \in S

A: Represents an action set , among a_i It means the first one i A moment's action , a \in A

P: Represents the transition probability . P_{sa} Represents slave state s Executive action a after , The probability of transition to another state .

R: The return function . From the State s Executive action a. Notice that the reward here is immediate . From which S State to A Mapping of actions , It's behavioral strategy , Write it down as a strategy \pi:S \rightarrow A.

MDP The dynamic process of is shown in the following figure ： Agent from state s_1, Take action a_1, Enter the state of s_2, And get a return r_1. Then from the State s_2 Take action a_2 Enter the state of s_3, Get paid r_2....

MDP Process chart

MDP It has Markov property , That is, writing a status is only related to the current status , No aftereffect .MDP At the same time, the action , It assumes that the probability of transition from the current state to the next state is only related to the current state and the action to be taken , It has nothing to do with the previous state .

Value function (value function)

We mentioned above MDP The payoff function in is immediate payoff , But in reinforcement learning , What we care about is not the immediate return at present , But the long-term benefits of the state under the strategy .

Let's put aside the action , Define value function (value function) To represent the policy in the current state π The long-term impact of .

Definition G_t From time to time t Start the sum of total discounted returns in the future :

G_t=R_{t+1}+ \gamma R_{t+2} + \gamma^2R_{t+3}+... = \sum_{k=0}^{\infty}r^{k}R_{t+k+1}

among γ∈(0,1) It is called the reduction factor , Indicates the importance of future returns relative to current returns .γ=0 when , It is equivalent to only considering the immediate without considering the long-term return ,γ=1 when , Consider long-term return as important as immediate return .

Value function V(s) Defined as the State S The long-term value of , namely G_t Mathematical expectation

V(s) = E[G_t |s=S_t]

Then think about MDP The process also has a strategy and action , Given policy π And initial state s, Then action a=π(s), We can define it

State value function ：

V_{\pi}(s)=E[R_{t+1} + \gamma V_{\pi}(S_{t+1})| S_t=s]

Action value function （ Given current state s And the current action a, Follow the strategy in the future π）：

q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1},A_{t+1})| S_t=s,A_t=a]

stay q_π(s,a) in , Not just strategy π And initial state s It's given by us , The current action a It is also what we give , This is a q_π(s,a) and Vπ(s) The main difference .

One MDP The optimal strategy is the optimal state - Value function ：v_*(s) =\underset {\pi}{\max}v_{\pi}(s) state s Next , Choose a strategy that maximizes value .

Solve the optimal strategy

As mentioned above, our goal is to find a strategy to maximize the value function , Find an optimal solution problem .

The main methods used are ：

Dynamic programming (dynamic programming methods）

Dynamic programming method （DP） Calculate the value function ：

V(S_{t})\leftarrow E_{\pi}[R_{t+1} + \gamma V(S_{t+1})] = \sum_a\pi(a|S_{t})\sum_{s',r}p(s',r|S_t,a)[r+\gamma V(s')]

It can be seen from the formula that DP The computed value function uses the current state s All of the s’ The value function at , That is to say bootstapping Method （ The current value function is estimated by the value function of the subsequent state ）. The subsequent state is modeled p(s',r|S_t,a) Calculated .

therefore DP Is a function that stores values for each state Q surface , according to Q Table get the best strategy through iterative method . Storage Q After the table , The current value function is estimated by the subsequent state value function , Single step update can be realized , Improve learning efficiency .

advantage ： Get ahead of time Q After the table , No more interaction with the environment , Get the best strategy directly through the iterative algorithm .

Inferiority ： Practical application , It is difficult to obtain the state transition probability and value function

The monte carlo method (Monte Carlo Methods)

The monte carlo method （MC）, Its state value function update formula is ：

V(S_t) \leftarrow V(S_t) + \alpha(G_t-V(S_t))

among G_t Is the cumulative return after each exploration ,\alpha For learning rate . We use it S_t Cumulative Return of Gt And current estimates V(S_t) The deviation value of is multiplied by the learning rate to update the current V(S_t) New estimates for . Because there is no model to get the state transition probability , therefore MC Is a value function of estimating the state using the empirical mean .

Monte Carlo method is a sampling method （ Trial and error , Interact with the environment ） To estimate the expected value function of the state . After sampling , The environment gives reward information , Embodied in the value function .

advantage ： There is no need to know the state transition probability , But through experience （ Sampling and trial and error experiments ） To evaluate the expected value function , As long as the number of samples is enough , Ensure every possible state - Actions can be sampled , It can approach the expectation to the greatest extent .

Inferiority ： Because the value function is from the state s Accumulated rewards till the final status , So every experiment must be carried out to the final state to get the state s The value function of , Learning efficiency is very low .

Time difference method (temporal difference)

Finally, look at the time difference method (TD) The value function update formula of the algorithm ：

V(S_t) \leftarrow V(S_t) + \alpha(r_{t+1}+ \gamma V(S_{t+1})-V(S_t))

among r_{t+1}+ \gamma V(S_{t+1}) and MC Medium G_t Corresponding , Models that do not require the use of state transition functions , At the same time, it does not need to be executed to the final state to get cumulative return , Instead, use the immediate return and value function of the next moment , similar DP Of bootstapping Method .

The time difference method combines the idea of dynamic specification with the idea of Monte Carlo sampling . Avoid dependence on state transition probability without environmental interaction , The value function of the state is estimated by sampling . Learn directly from experience , Similar to Monte Carlo method . At the same time, it is allowed to estimate the current value function based on the value function of subsequent states before reaching the final state to realize one-step update , Similar to dynamic programming to improve efficiency .

Strengthen learning elements

After passing the above basic knowledge , Let's go back to the framework of reinforcement learning ：

agent agent ： Decision makers

environment Environmental Science : Things that interact with agents

State state ： At present agent The state of being in the environment

action action ：agent Each time in the state and the last state reward Together with the strategy, it determines the current actions to be performed

reward Reward : Environment rewards for action return

policy Strategy ：agent act , from state To action Mapping

What we need to do with the reinforcement learning project is , To solve a problem, we first extract an environment and the agents interacting with it , Extract the state from the environment （S） And the action （A）, And the instant reward for performing an action （R）, The agent decides an optimal strategy algorithm , Under this strategy, the agent can perform the optimal action in the environment , Accumulated reward Maximum .

If there is any misunderstanding , Welcome to point out ~

Reference material ：

https://zhuanlan.zhihu.com/p/25913410

https://www.cnblogs.com/jinxulin/p/5116332.html

原网站

版权声明
本文为[languageX]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/01/202201051711021485.html