next up previous
Next: TD-Gammon Up: TD-Gammon Previous: methods of encoding

   
Choosing the parameters for F(r,s)

If it is possible, we would like to find a vector r such that:

\begin{displaymath}\forall s:\ F(r,s) = V^{\pi}(s)\end{displaymath}

In most cases, it isn't possible since there aren't sufficient computing resources. Therefore, we should discuss the distance between F and $V^{\pi}$. One way to measure this distance is to calculate the minimum square error (MSE):

\begin{displaymath}^{\min}_{\ r}E_{s}[(F(r,s)-V^{\pi}(s))^{2}]\end{displaymath}

If it is possible to encode $V^{\pi}$ using F(r,s), this minimum equals to 0. Otherwise, we try to get as close as possible to $V^{\pi}(s)$. For linear functions it is sometimes possible to compute $\overrightarrow{r}$ which minimizes the MSE, but generally, it is a difficult task. Let's look at the problem in the following manner: we try to minimize

\begin{displaymath}G(r,s)=E_{s}[(F(r,s)-V^{\pi}(s))^{2}]\ .\end{displaymath}

Finding the global minimum cannot always be done efficiently, and it is much easier to find a local minimum (and hope it is good enough).To find a local minimum, we can follow the derivative of G(r,s). As long as the derivative is non zero, we have a direction, in which G(r,s) is descending. When the value of the derivative equals to zero, we are done. We now only need to establish that this is a minimum (and not a maximum or an inflection point). The reason that this is not a maximum value is that the function descends to this point. However, we could be at an inflection point, and remain there (because the derivative could be fixed on zero in that neighborhood).

\begin{displaymath}\overrightarrow{r_{t+1}} =
\overrightarrow{r_{t}}-\frac{\alp...
...)-V_{t}(s_{t})]\cdot\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})\end{displaymath}

Recall that:

\begin{displaymath}\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t}) =
(\frac{\partia...
...(s_{t}),\ldots,\frac{\partial}{\partial
r_{k}}V_{t}(s_{t}))\ .\end{displaymath}

Of course, we don't have access to $V^{\pi}(s_{t})$, but rather to a sample of it vt (which could be noisy). Thus, we have:

\begin{displaymath}\overrightarrow{r_{t+1}} = \overrightarrow{r_{t}} +
\alpha[v_{t}-V_{t}(s_{t})]\cdot\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})\end{displaymath}

For example, vt could be a Monte-Carlo estimate, i.e. the total reward from time t onwards Rt. In the case of backgammon, vt denotes whether we won or lost the game. Similarly,

\begin{displaymath}\overrightarrow{r_{t+1}} =
\overrightarrow{r_{t}} +
\alpha[...
...}-V_{t}(s_{t})]\cdot\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})\end{displaymath}


next up previous
Next: TD-Gammon Up: TD-Gammon Previous: methods of encoding
Yishay Mansour
2000-01-17