next up previous
Next: About this document ... Up: TD( ) algorithm Previous: Algorithm

   
Equality of schemes using backward and forward updates

Now we will show that "looking forward" and "looking backward" schemes are equal (i.e. will give same results).
Let us say:
$\ \Delta V_{t}^{\gamma}(s_{t}) = \alpha [R_{t}^{\gamma} -
V_{t}(s_{t})] $ for the forward scheme
$\ \Delta V_{t}^{TD}(s) = \alpha \delta_{t} e_{t}(s) $ for the backward scheme
We will show that
$\ \Sigma_{t=0}^{T-1} \Delta V_{t}^{TD}(s) = \Sigma_{t=0}^{T-1}
\Delta V_{t}^{\gamma}(s_{t}) I(s,s_{t}) $
where,
$\
I(s,s_{t}) = 1 $ if $\ s = s_{t} $ ,and $\ I(s,s_{t}) = 0 $ if $\ s\neq s_{t} $
We can rewrite $\ e_{t}(s) $ as $\ e_{t}(s) = \Sigma_{k=0}^{t}
(\gamma \lambda)^{t-k} I(s,s_{k}) $
We have for the left side of the equation:

$\
\Sigma_{t=0}^{T-1} \Delta V_{t}^{TD}(s) = \Sigma_{t=0}^{T-1}
\alpha \delta_{t} \Sigma_{k=0}^{t} (\gamma \lambda)^{t-k}
I(s,s_{k}) $

= $\ \Sigma_{k=0}^{T-1} \alpha
\Sigma_{t=0}^{k} (\gamma \lambda)^{k-t} I(s,s_{t}) \delta_{k} $

$\ = \Sigma_{t=0}^{T} \alpha \Sigma_{k=0}^{T-1} (\gamma
\lambda)^{k-t} I(s,s_{t}) \delta_{k} $

$\ =
\Sigma_{t=0}^{T-1} \alpha I(s,s_{t}) \Sigma_{k=t}^{T-1} (\gamma
\lambda)^{k-t} \delta_{k} $

And for the right side:

$\ \frac{1}{\alpha}
\Delta_{t}^{\gamma} (s) = R_{t}^{\gamma} - V_{t}(S) $
$\ =
-V_{t}(s_{t}) + (1-\gamma)\gamma^{0}[r_{t} + \gamma V_{t}(s_{t})]
+ (1-\gamma)\gamma[r_{t} + \gamma r_{t+1} + \gamma^{2}
V_{t}(s_{t})] + $ ...

$\ = -V_{t}(s_{t}) +
\Sigma_{i=0}^{\infty} (\lambda\gamma)^{i} [r_{t+i} + \lambda
(1-\gamma) V_{t}(s_{t+i+1})] $

$\ = -V_{t}(s_{t}) +
\Sigma_{i=0}^{\infty} (\lambda\gamma)^{i} [r_{t+i} + \lambda
V_{t}(s_{t+i+1}) - V_{t}(s_{t+i})] $

$\ =
\Sigma_{k=t}^{\infty} (\lambda\gamma)^{k-t} \delta_{k} =
\Sigma_{k=t}^{T-1} (\lambda\gamma)^{k-t} \delta_{k} $

So, we have :
$\ \Sigma_{t=0}^{T} \Delta V_{t}^{\gamma} (s_{t})
I(s,s_{t}) = \Sigma_{t=0}^{T-1} \alpha I(s,s_{t})
\Sigma_{k=t}^{T-1} (\lambda\gamma)^{k-t} \delta_{k} $
We have for right side the same expression we have already for the left side of the equation. This implies that both methods give same updates at the end of the run.


Yishay Mansour
2000-01-06