next up previous
Next: Forward Updates Versus Backward Up: TD( ) algorithm Previous: TD( ) algorithm

   
TD($\ \gamma $ ) as generalization of MC and TD(0) methods

MC uses overall return of the run to perform update, i.e., $\
R_{t}=\sum_{i=0}^{T}\lambda^{i}r_{t+i}$
TD(0) uses only immediate reward to make update and its update has the form,
$\
R_{t}^{(1)}=r_{t} + \lambda V_{t}(s_{t+1}) $
We can think about other options between those two extremes. For example we can use two immediate reward values , i.e. , $\
R_{t}^{(2)}=r_{t} + \lambda r_{t+1} + \lambda^{2} V_{t}(s_{t+2})
$
Generally - if n values are in use, we have, $\
R_{t}^{(n)}=r_{t} + \lambda r_{t+1} + ... + \lambda^{n-1} r_{n-1}
+ \lambda^{n} V_{t}(S_{n}) $
Note: If $\ n $ is more than number of steps to the end of the run, we set all the rewards after the run end to 0.
The update is defined as $\ \triangle V_{t}(s_{t}) =
\alpha[R_{t}^{(n)}-V_{t}(s_{t})] $
Update can be performed:
1.
online: $\ V_{t+1}(s_{t}) \leftarrow V_{t}(s_{t}) + \triangle
V_{t}(s_{t}) $
2.
off-line: $\ V(s) \leftarrow V(s) + \sum\triangle V_{t}(s)
$
It is easy to see that operator $\ R_{t}^{(n)}$ is $\
\lambda^{n}$ contracting.
$\ [ \Vert R_{t}^{(n)}(V^{(1)}) -
R_{t}^{(n)}(V^{(2)})\Vert = \lambda^{n}\Vert V^{(1)} - V^{(2)}\Vert ] $
So we expect to get improvement and reduce error while making iterations.
The central remaining question is choice of $\ n $ . One way to do it is to use weighed average of all possible values.
The method $\ TD(\gamma) $ gives a geometric distribution with parameter $\ \gamma $ over $\ R_{t}^{(n)}$ ,
i.e. weight of $\ R_{t}^{(n)}$ is $\ (1-\gamma) \gamma^{n-1} .$
We define $\ R_{t}^{\gamma} $ to be $\ R_{t}^{\gamma} =
(1-\gamma)\sum_{n=1}^{\infty}\gamma^{n-1}R_{t}^{(n)} $

For finite run this expression can be rewritten as :
$\
R_{t}^{\gamma} = (1-\gamma)\sum_{n=1}^{T-1}\gamma^{n-1}R_{t}^{(n)}
+ \gamma^{T}R_{t}$
1.
If $\ \gamma=0$ then weight of $\ R_{t}^{(1)} $ is 1 and weights of any other $\ R_{t}^{(n)}$ equals 0. This reduces TD(0), which we introduced before.
2.
For $\ \gamma = 1 $ we'll get $\ R_{t}^{(1)} = R_{t} $ (total reward) .

next up previous
Next: Forward Updates Versus Backward Up: TD( ) algorithm Previous: TD( ) algorithm
Yishay Mansour
2000-01-06