Deriving the Bellman Equation

考虑以下 Trajectory :

StAtRt+1,St+1At+1Rt+2,St+2At+2Rt+3,S_t\xrightarrow{A_t}R_{t+1},S_{t+1}\xrightarrow{A_{t+1}}R_{t+2},S_{t+2}\xrightarrow{A_{t+2}}R_{t+3},\cdots

回报为:

Gt=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+)=Rt+1+γGt+1\begin{aligned} G_t &= R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots\\ &= R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+\cdots)\\ &= R_{t+1}+\gamma G_{t+1} \end{aligned}

则 State Value 可以写作:

Vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+γGt+1St=s]=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]\begin{aligned} V^\pi(s) &= \mathbb{E}^\pi \left[G_t\mid S_t = s \right]\\ &= \mathbb{E}^\pi \left[R_{t+1}+\gamma G_{t+1}\mid S_t = s \right]\\ &= \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]+\gamma\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right] \end{aligned}

先来看 Eπ[Rt+1St=s]\mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]

Eπ[Rt+1St=s]=aπ(as)E[Rt+1St=s,At=a]=aπ(as)rp(rs,a)r\begin{aligned} \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]&=\sum_a\pi(a\mid s)\mathbb{E}\left[R_{t+1}\mid S_t=s,A_t=a\right]\\ &=\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r \end{aligned}

关于 E[Rt+1St=s,At=a]\mathbb{E}\left[R_{t+1}\mid S_t=s,A_t=a\right]rp(rs,a)r\sum_r p(r|s,a)\cdot r 的变换,其实就是当前策略和状态确定时,可以获得的奖励集合也是确定的,所以期望就是所有可能收益加权求和。

可以说,它的含义就是在tt 步采取动作之后、环境返回的立即奖励(immediate reward)的期望值

再来看 Eπ[Gt+1St=s]\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right]

Eπ[Gt+1St=s]=sE[Gt+1St=s,St+1=s]p(ss)=sE[Gt+1St+1=s]p(ss)(Markov property)=svπ(s)aπ(as)p(ss,a)\begin{aligned} \mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right] &= \sum_{s^\prime}\mathbb{E}\left[G_{t+1}\mid S_t=s,S_{t+1}=s^\prime\right]p(s^\prime\mid s)\\ &= \sum_{s^\prime}\mathbb{E}\left[G_{t+1}\mid S_{t+1}=s^\prime\right]p(s^\prime\mid s)\quad\text{(Markov property)}\\ &= \sum_{s^\prime}v^\pi(s^\prime)\sum_a \pi(a|s)p(s^\prime|s,a) \end{aligned}

它的物理含义就是tt 步采取动作之后,下一步带来的所有未来奖励(future reward)的期望值

贝尔曼公式(Bellman Equation) 就是:

vπ(s)=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]=aπ(as)rp(rs,a)r+γsvπ(s)aπ(as)p(ss,a)=aπ(as)rp(rs,a)rmean of immediate rewards+γaπ(as)sp(ss,a)vπ(s)mean of future rewards=aπ(as)[rp(rs,a)r+sp(ss,a)vπ(s)],sS\begin{aligned} v^\pi(s) &= \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]+\gamma\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right]\\ &=\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r + \gamma\sum_{s^\prime}v^\pi(s^\prime)\sum_a \pi(a|s)p(s^\prime|s,a)\\ &=\underbrace{\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r}_{\text{mean of immediate rewards}} + \underbrace{\gamma\sum_a\pi(a|s)\sum_{s^\prime} p(s^\prime|s,a)v^\pi(s^\prime)}_\text{mean of future rewards}\\ &= \sum_a\pi(a\mid s)\left[\sum_r p(r|s,a)\cdot r + \sum_{s^\prime} p(s^\prime|s,a)v^\pi(s^\prime)\right],\forall s\in \mathbb{S} \end{aligned}

其中:

  • π(as)\pi(a\mid s) 由策略决定;
  • p(rs,a)p(r\mid s,a)p(ss,a)p(s^\prime\mid s,a) 由 model 决定;(对于 model-free 的情况则需要采用另外的算法)
  • 虽然 vπ(s)v^\pi(s) 依赖于 vπ(s)v^\pi(s^\prime) ,但事实上可以写出一组形如上面的式子,再求解线性方程组(不难看成其是满秩的)

为求解方便,还可以进一步将其写为矩阵形式:

vπ(s)=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]=aπ(as)rp(rs,a)r+γsvπ(s)aπ(as)p(ss,a)=rπ(s)+γsvπ(s)p(ss),sS\begin{aligned} v^\pi(s) &= \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]+\gamma\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right]\\ &=\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r + \gamma\sum_{s^\prime}v^\pi(s^\prime)\sum_a \pi(a|s)p(s^\prime|s,a)\\ &= r^\pi(s) + \gamma \sum_{s^\prime}v^\pi(s^\prime)p(s^\prime\mid s),\forall s\in \mathbb{S} \end{aligned}

对于 siS={s1,s2,,sn}s_i\in \mathbb{S}=\{s_1,s_2,\cdots,s_n\}

vπ(si)=rπ(si)+γsjvπ(sj)p(sjsi)v^\pi(s_i) = r^\pi(s_i) + \gamma \sum_{s_j}v^\pi(s_j)p(s_j\mid s_i)

令:

  • vπ=[vπ(s1),vπ(s2),,vπ(sn)]TRnv^\pi=\begin{bmatrix}v^\pi(s_1),v^\pi(s_2),\cdots,v^\pi(s_n)\end{bmatrix}^T\in\mathbb{R}^n
  • rπ=[rπ(s1),rπ(s2),,rπ(sn)]TRnr^\pi=\begin{bmatrix}r^\pi(s_1),r^\pi(s_2),\cdots,r^\pi(s_n)\end{bmatrix}^T\in \mathbb{R}^n
  • PπRn×n,where Pi,jπ=p(sjsi)P^\pi\in \mathbb{R}^{n\times n},\text{where }P^\pi_{i,j} = p(s_j\mid s_i) ,即状态转移矩阵;

则 Bellman Equation 可以写作:

vπ=rπ+γPπvπv^\pi = r^\pi + \gamma P^\pi v^\pi


Deriving the Bellman Equation
https://blog.yokumi.cn/2025/07/04/Deriving the Bellman Equation/
作者
Yokumi
发布于
2025年7月4日
更新于
2025年7月10日
许可协议