next up previous
Next: Correctness of Value Iteration Up: Finding the Optimal Policy: Previous: Finding the Optimal Policy:

   
The Value Iteration Algorithm

Input: MDP and parameters $\epsilon$ and $\lambda$
1.
Choose an initial return value function $\vec{v}_{0}$ (by choosing a number for each $s\in S$).
2.
$n\leftarrow 0$.
3.
  Assign the next return value function:

\begin{eqnarray*}\forall s\in S,\ v_{n+1}(s) \leftarrow \max_{a \in A_{s}}\{r(s,a)
+ \lambda\sum_{j\in S}p(j\mid s,a)v_{n}(j)\}
\end{eqnarray*}


4.
$n\leftarrow n+1$
5.
If $\Vert\vec{v}_{n+1}-\vec{v}_{n}\Vert < \epsilon \cdot \frac{1-
\lambda}{2\lambda}$, stop,
Else return to ([*]).
6.
Choose the output policy such that:

\begin{eqnarray*}\pi_{\epsilon}(s) \in argmax_{a \in A_{s}}\{r(s,a) + \lambda\sum_{j\in S}p(j\mid s,a)v_{n+1}(j)\}
\end{eqnarray*}




Yishay Mansour
1999-12-18