考虑以下 Trajectory :
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , ⋯ S_t\xrightarrow{A_t}R_{t+1},S_{t+1}\xrightarrow{A_{t+1}}R_{t+2},S_{t+2}\xrightarrow{A_{t+2}}R_{t+3},\cdots
S t A t R t + 1 , S t + 1 A t + 1 R t + 2 , S t + 2 A t + 2 R t + 3 , ⋯
回报为:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + ⋯ ) = R t + 1 + γ G t + 1 \begin{aligned}
G_t &= R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots\\
&= R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+\cdots)\\
&= R_{t+1}+\gamma G_{t+1}
\end{aligned}
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + ⋯ ) = R t + 1 + γ G t + 1
则 State Value 可以写作:
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] \begin{aligned}
V^\pi(s) &= \mathbb{E}^\pi \left[G_t\mid S_t = s \right]\\
&= \mathbb{E}^\pi \left[R_{t+1}+\gamma G_{t+1}\mid S_t = s \right]\\
&= \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]+\gamma\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right]
\end{aligned}
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ]
先来看 E π [ R t + 1 ∣ S t = s ] \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right] E π [ R t + 1 ∣ S t = s ] :
E π [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) ⋅ r \begin{aligned}
\mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]&=\sum_a\pi(a\mid s)\mathbb{E}\left[R_{t+1}\mid S_t=s,A_t=a\right]\\
&=\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r
\end{aligned}
E π [ R t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) ⋅ r
关于 E [ R t + 1 ∣ S t = s , A t = a ] \mathbb{E}\left[R_{t+1}\mid S_t=s,A_t=a\right] E [ R t + 1 ∣ S t = s , A t = a ] 到 ∑ r p ( r ∣ s , a ) ⋅ r \sum_r p(r|s,a)\cdot r ∑ r p ( r ∣ s , a ) ⋅ r 的变换,其实就是当前策略和状态确定时,可以获得的奖励集合也是确定的,所以期望就是所有可能收益加权求和。
可以说,它的含义就是在第 t t t 步采取动作之后、环境返回的立即奖励(immediate reward)的期望值 。
再来看 E π [ G t + 1 ∣ S t = s ] \mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right] E π [ G t + 1 ∣ S t = s ] :
E π [ G t + 1 ∣ S t = s ] = ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (Markov property) = ∑ s ′ v π ( s ′ ) ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) \begin{aligned}
\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right] &= \sum_{s^\prime}\mathbb{E}\left[G_{t+1}\mid S_t=s,S_{t+1}=s^\prime\right]p(s^\prime\mid s)\\
&= \sum_{s^\prime}\mathbb{E}\left[G_{t+1}\mid S_{t+1}=s^\prime\right]p(s^\prime\mid s)\quad\text{(Markov property)}\\
&= \sum_{s^\prime}v^\pi(s^\prime)\sum_a \pi(a|s)p(s^\prime|s,a)
\end{aligned}
E π [ G t + 1 ∣ S t = s ] = s ′ ∑ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = s ′ ∑ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (Markov property) = s ′ ∑ v π ( s ′ ) a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a )
它的物理含义就是第 t t t 步采取动作之后,下一步带来的所有未来奖励(future reward)的期望值 。
贝尔曼公式(Bellman Equation) 就是:
v π ( s ) = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) ⋅ r + γ ∑ s ′ v π ( s ′ ) ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) ⋅ r ⏟ mean of immediate rewards + γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ⏟ mean of future rewards = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) ⋅ r + ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S \begin{aligned}
v^\pi(s) &= \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]+\gamma\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right]\\
&=\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r + \gamma\sum_{s^\prime}v^\pi(s^\prime)\sum_a \pi(a|s)p(s^\prime|s,a)\\
&=\underbrace{\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r}_{\text{mean of immediate rewards}} + \underbrace{\gamma\sum_a\pi(a|s)\sum_{s^\prime} p(s^\prime|s,a)v^\pi(s^\prime)}_\text{mean of future rewards}\\
&= \sum_a\pi(a\mid s)\left[\sum_r p(r|s,a)\cdot r + \sum_{s^\prime} p(s^\prime|s,a)v^\pi(s^\prime)\right],\forall s\in \mathbb{S}
\end{aligned}
v π ( s ) = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) ⋅ r + γ s ′ ∑ v π ( s ′ ) a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a ) = mean of immediate rewards a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) ⋅ r + mean of future rewards γ a ∑ π ( a ∣ s ) s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) ⋅ r + s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S
其中:
π ( a ∣ s ) \pi(a\mid s) π ( a ∣ s ) 由策略决定;
p ( r ∣ s , a ) p(r\mid s,a) p ( r ∣ s , a ) 和 p ( s ′ ∣ s , a ) p(s^\prime\mid s,a) p ( s ′ ∣ s , a ) 由 model 决定;(对于 model-free 的情况则需要采用另外的算法)
虽然 v π ( s ) v^\pi(s) v π ( s ) 依赖于 v π ( s ′ ) v^\pi(s^\prime) v π ( s ′ ) ,但事实上可以写出一组形如上面的式子,再求解线性方程组(不难看成其是满秩的)
为求解方便,还可以进一步将其写为矩阵形式:
v π ( s ) = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) ⋅ r + γ ∑ s ′ v π ( s ′ ) ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) = r π ( s ) + γ ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) , ∀ s ∈ S \begin{aligned}
v^\pi(s) &= \mathbb{E}^\pi \left[R_{t+1}\mid S_t = s \right]+\gamma\mathbb{E}^\pi \left[G_{t+1}\mid S_t = s \right]\\
&=\sum_a\pi(a\mid s)\sum_r p(r|s,a)\cdot r + \gamma\sum_{s^\prime}v^\pi(s^\prime)\sum_a \pi(a|s)p(s^\prime|s,a)\\
&= r^\pi(s) + \gamma \sum_{s^\prime}v^\pi(s^\prime)p(s^\prime\mid s),\forall s\in \mathbb{S}
\end{aligned}
v π ( s ) = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) ⋅ r + γ s ′ ∑ v π ( s ′ ) a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a ) = r π ( s ) + γ s ′ ∑ v π ( s ′ ) p ( s ′ ∣ s ) , ∀ s ∈ S
对于 s i ∈ S = { s 1 , s 2 , ⋯ , s n } s_i\in \mathbb{S}=\{s_1,s_2,\cdots,s_n\} s i ∈ S = { s 1 , s 2 , ⋯ , s n } :
v π ( s i ) = r π ( s i ) + γ ∑ s j v π ( s j ) p ( s j ∣ s i ) v^\pi(s_i) = r^\pi(s_i) + \gamma \sum_{s_j}v^\pi(s_j)p(s_j\mid s_i)
v π ( s i ) = r π ( s i ) + γ s j ∑ v π ( s j ) p ( s j ∣ s i )
令:
v π = [ v π ( s 1 ) , v π ( s 2 ) , ⋯ , v π ( s n ) ] T ∈ R n v^\pi=\begin{bmatrix}v^\pi(s_1),v^\pi(s_2),\cdots,v^\pi(s_n)\end{bmatrix}^T\in\mathbb{R}^n v π = [ v π ( s 1 ) , v π ( s 2 ) , ⋯ , v π ( s n ) ] T ∈ R n ;
r π = [ r π ( s 1 ) , r π ( s 2 ) , ⋯ , r π ( s n ) ] T ∈ R n r^\pi=\begin{bmatrix}r^\pi(s_1),r^\pi(s_2),\cdots,r^\pi(s_n)\end{bmatrix}^T\in \mathbb{R}^n r π = [ r π ( s 1 ) , r π ( s 2 ) , ⋯ , r π ( s n ) ] T ∈ R n ;
P π ∈ R n × n , where P i , j π = p ( s j ∣ s i ) P^\pi\in \mathbb{R}^{n\times n},\text{where }P^\pi_{i,j} = p(s_j\mid s_i) P π ∈ R n × n , where P i , j π = p ( s j ∣ s i ) ,即状态转移矩阵;
则 Bellman Equation 可以写作:
v π = r π + γ P π v π v^\pi = r^\pi + \gamma P^\pi v^\pi
v π = r π + γ P π v π