当前位置:网站首页>强化學習——策略梯度理解點
强化學習——策略梯度理解點
2022-07-23 14:10:00 【小陳phd】
策略梯度計算公式
目的是最大化reward函數,即調整 θ ,使得期望回報最大,可以用公式錶示如下
J ( θ ) = E τ ∼ p ( T ) [ ∑ t r ( s t , a t ) ] \mathrm{J}(\theta)=\mathrm{E}_{\tau \sim p}(\mathcal{T})\left[\sum_{\mathrm{t}} \mathrm{r}\left(\mathrm{s}_{t}, \mathrm{a}_{\mathrm{t}}\right)\right] J(θ)=Eτ∼p(T)[t∑r(st,at)]
對於上面的式子, τ \tau τ 錶示從從開始到結束的一條完整路徑。通常,對於最大化問題,我們可以使用梯度上昇算法來找到最大值,即
θ ∗ = θ + α ∇ J ( θ ) \theta^{*}=\theta+\alpha \nabla \mathrm{J}(\theta) θ∗=θ+α∇J(θ)
所以我們僅僅需要計算 (更新) ∇ J ( θ ) \nabla J(\theta) ∇J(θ) ,也就是計算回報函數 J ( θ ) J(\theta) J(θ) 關於 θ \theta θ 的梯度,也就是策略梯度,計算方法如下:
∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ = ∫ p θ ∇ θ log p θ ( τ ) r ( τ ) d τ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ ) ] \begin{aligned} \nabla_{\theta} \mathrm{J}(\theta) &=\int \nabla_{\theta \mathrm{p}_{\theta}}(\tau) \mathrm{r}(\tau) \mathrm{d}_{\tau} \\ &=\int \mathrm{p}_{\theta} \nabla_{\theta} \log \mathrm{p}_{\theta}(\tau) \mathrm{r}(\tau) \mathrm{d}_{\tau} \\ &=\mathrm{E}_{\tau \sim \mathrm{p} \theta(\tau)}\left[\nabla_{\theta} \log \mathrm{p}_{\theta}(\tau) \mathrm{r}(\tau)\right] \end{aligned} ∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ=∫pθ∇θlogpθ(τ)r(τ)dτ=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]
接著我們繼續講上式展開,對於 p θ ( τ ) \mathrm{p}_{\theta}(\tau) pθ(τ) ,即 p θ ( τ ∣ θ ) \mathrm{p}_{\theta}(\tau \mid \theta) pθ(τ∣θ) :
p θ ( τ ∣ θ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) \mathrm{p}_{\theta}(\tau \mid \theta)=\mathrm{p}\left(\mathrm{s}_{1}\right) \prod_{t=1}^{\mathrm{T}} \pi_{\theta}\left(\mathrm{a}_{t} \mid \mathrm{s}_{\mathrm{t}}\right) \mathrm{p}\left(\mathrm{s}_{t+1} \mid \mathrm{s}_{t}, \mathrm{a}_{\mathrm{t}}\right) pθ(τ∣θ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)
取對數後為:
log p θ ( τ ∣ θ ) = log p ( s 1 ) + ∑ t = 1 T log π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) \log p_{\theta}(\tau \mid \theta)=\log p\left(s_{1}\right)+\sum_{t=1}^{T} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right) logpθ(τ∣θ)=logp(s1)+t=1∑Tlogπθ(at∣st)p(st+1∣st,at)
繼續求導:
∇ log p θ ( τ ∣ θ ) = ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) \nabla \log p_{\theta}(\tau \mid \theta)=\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) ∇logpθ(τ∣θ)=t=1∑T∇θlogπθ(at∣st)
帶入第三個式子,可以將其化簡為:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ ) ] = E τ ∼ p θ [ ( ∇ θ log π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] = 1 N ∑ i = 1 N [ ( ∑ t = 1 T ∇ θ log π θ ( a i , t ∣ s i , t ) ) ( ∑ t = 1 N r ( s i , t , a i , t ) ) ] \begin{aligned} \nabla_{\theta} \mathrm{J}(\theta) &=\mathrm{E}_{\tau \sim p \theta(\tau)}\left[\nabla_{\theta} \log \mathrm{p}_{\theta}(\tau) \mathrm{r}(\tau)\right] \\ &=\mathrm{E}_{\tau \sim p \theta}\left[\left(\nabla_{\theta} \log \pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)\right)\left(\sum_{\mathrm{t}=1}^{\mathrm{T}} \mathrm{r}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)\right)\right] \\ &=\frac{1}{N} \sum_{\mathrm{i}=1}^{\mathrm{N}}\left[\left(\sum_{\mathrm{t}=1}^{\mathrm{T}} \nabla_{\theta} \log \pi_{\theta}\left(\mathrm{a}_{\mathrm{i}, \mathrm{t}} \mid \mathrm{s}_{\mathrm{i}, \mathrm{t}}\right)\right)\left(\sum_{\mathrm{t}=1}^{\mathrm{N}} \mathrm{r}\left(\mathrm{s}_{\mathrm{i}, \mathrm{t}}, \mathrm{a}_{\mathrm{i}, \mathrm{t}}\right)\right)\right] \end{aligned} ∇θJ(θ)=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]=Eτ∼pθ[(∇θlogπθ(at∣st))(t=1∑Tr(st,at))]=N1i=1∑N[(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Nr(si,t,ai,t))]
重要性重采樣
使用另外一種數據分布,來逼近所求分布的一種方法,算是一種期望修正的方法,公式是:
∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ∼ q [ f ( x ) p ( x ) q ( x ) ] = E x ∼ p [ f ( x ) ] \begin{aligned} \int \mathrm{f}(\mathrm{x}) \mathrm{p}(\mathrm{x}) \mathrm{dx} &=\int \mathrm{f}(\mathrm{x}) \frac{\mathrm{p}(\mathrm{x})}{\mathrm{q}(\mathrm{x})} \mathrm{q}(\mathrm{x}) \mathrm{dx} \\ &=\mathrm{E}_{\mathrm{x} \sim \mathrm{q}}\left[\mathrm{f}(\mathrm{x}) \frac{\mathrm{p}(\mathrm{x})}{\mathrm{q}(\mathrm{x})}\right] \\ &=\mathrm{E}_{\mathrm{x} \sim \mathrm{p}}[\mathrm{f}(\mathrm{x})] \end{aligned} ∫f(x)p(x)dx=∫f(x)q(x)p(x)q(x)dx=Ex∼q[f(x)q(x)p(x)]=Ex∼p[f(x)]
在已知 q q q 的分布後,可以使用上述公式計算出從 p 分布的期望值。也就可以使用 q q q 來對於 p 進行采樣了,即為重要性采樣。
边栏推荐
- Is machine learning difficult to get started? Tell me how I started machine learning quickly!
- OSPF详解(LSA)(2)
- OSPF details (1)
- Review of HCIA
- Thousands of databases, physical machines all over the country, JD logistics full volume cloud live record | interview with excellent technical team
- 【百企行】牛耳教育助力高校访企拓岗促就业专项行动
- 英特尔赛扬7300性能怎么样?相当于什么水平级别
- 赛扬n5095处理器怎么样 英特尔n5095核显相当于什么水平
- rtx3080ti和rtx3080差距 3080和3080ti参数对比
- OSPF experiment in mGRE environment:
猜你喜欢

overlayfs源代码解析

Rip experiment

What is Tianji 920 equivalent to a snapdragon? How much is Tianji 920 equivalent to a snapdragon? How about Tianji 920

rtx3090ti什么水平 rtx3090ti显卡什么级别 rtx3090ti显卡怎么样

OSPF综合实验

Is machine learning difficult to get started? Tell me how I started machine learning quickly!

头部产品创收25亿,SLG赛道也被黑产盯上了

How about the nuclear display performance of Ruilong R7 Pro 6850h? What level is it equivalent to

Notes on the seventh day

How can Creo 9.0 quickly modify CAD coordinate system?
随机推荐
英特尔赛扬7300性能怎么样?相当于什么水平级别
MySQL enables scheduled task execution
HCIA的复习
Notes on the fourth day
Notes on the sixth day
MGRE experiment
Medium range
Kingbaseesv8r6 difference of xmin under different isolation levels
达人评测酷睿i7 12850hx和i7 12700h选哪个
天玑820相当于骁龙多少处理器 天玑1100相当于骁龙多少 天玑820怎么样
[understanding of opportunity-50]: Guiguzi - the twelfth Rune chapter - the art of being a good leader: keep your position, observe the four directions, cave in danger, talk widely, empty advice, set
Golang remote server debugging
BERT 文章翻译
MGRE实验
Using redis to realize distributed lock (single redis)
天玑1100相当于骁龙多少处理器 天玑1100相当于骁龙多少 天玑1100怎么样
《Animal Farm》笔记
第五天筆記
In depth analysis of common cross end technology stacks of app
静态综合实验(HCIA)