next up previous
Next: Example: Running Value Iteration Up: Finding the Optimal Policy: Previous: Correctness of Value Iteration

   
Convergence of Value Iteration Algorithm

In this section we show that the convergence to the optimal policy of VI algorithm is monotonic, and that the convergence is exponentially fast in the parameter $\lambda$.

Claim 6.2   Monotonicity of convergence:
1.
if $v \geq u$ then $Lv \geq Lu$
2.
if $Lv_{n} \geq v_{n}$ then $\forall m \geq 0,\ v_{n+m+1} \geq v_{n+m}$

Proof:For part (1), let

\begin{eqnarray*}\delta \in argmax_{\pi \in \Pi^{MD}}\{r_{\pi} + \lambda P_{\pi}u\}
\end{eqnarray*}


Since $P_{\delta} \geq 0$, it follows that $P_{\delta}v \geq P_{\delta}u$. Therefore:

\begin{eqnarray*}Lu = r_{\delta} + \lambda P_{\delta}u \leq r_{\delta} + \lambda...
...elta}v \leq \max_{d \in
\Pi^{MD}}\{r_{d} + \lambda P_{d}v\} = Lv
\end{eqnarray*}


This proves part (1).
From part (1) it follows by induction that if $v \geq u$, then $\forall m \geq 1,\ L^{m}v \geq L^{m}u$.
To prove part (2):

\begin{eqnarray*}v_{n+m+1} = L^{m}Lv_{n} \geq L^{m}v_{n} = v_{n+m}
\end{eqnarray*}


$\Box$the above claim it follows that if $Lv_{0} \geq v_{0}$ (or $Lv_{0} \leq v_{0}$), VI will converge monotonically, which means that running more iterations will always lead to a better policy. If all rewards in a specific MDP are non-negative (or non-positive), we can choose $v_{0} = \vec{0}$, which assures the condition is met.

Theorem 6.3   Let vn be the sequence of values computed by the VI algorithm.
1.
$\forall n, \Vert v_{n} - v_{\lambda}^{*}\Vert \leq \frac{\lambda^{n}}{1-\lambda}\Vert v_{1} -
v_{0}\Vert$
2.
$\Vert v_{\lambda}^{\pi_{n}} - v_{\lambda}^{*}\Vert \leq \frac{2\lambda^{n}}{1-\lambda}\Vert v_{1} -
v_{0}\Vert$, where $\pi_{n}$ is the policy that is defined by vn.

Proof:Part (1): We will use the result of part ([*]) in theorem [*]:

\begin{eqnarray*}If\ \ \Vert v_{n}-v_{n-1}\Vert < \epsilon \cdot \frac{1-\lambda...
...mbda}\ \ then\ \ \Vert v_{n} -
v_{\lambda}^{*}\Vert < {\epsilon}
\end{eqnarray*}


To use it we bound:

\begin{eqnarray*}\Vert v_{n}-v_{n-1}\Vert& = & \Vert L^{n-1}v_{1}-L^{n-1}v_{0}\V...
...a^{n-1}\Vert v_{1} - v_{0}\Vert\ \ (because\ L\ is\ contracting)
\end{eqnarray*}


Let $\epsilon = \frac{\lambda^{n}}{1-\lambda}\Vert v_{1} - v_{0}\Vert$, so that we can use theorem [*] to conclude that:

\begin{eqnarray*}\Vert v_{n} - v_{\lambda}^{*}\Vert \leq \frac{\lambda^{n}}{1-\lambda}\Vert v_{1} - v_{0}\Vert
\end{eqnarray*}


Which proves part (1).
To prove part (2) we bound:

\begin{eqnarray*}\Vert v_{\lambda}^{\pi_{n}} - v_{\lambda}^{*}\Vert & \leq & \Ve...
...^{\pi_{n}} - v_{n}\Vert + \Vert v_{n}
- v_{\lambda}^{*}\Vert \\
\end{eqnarray*}


The following bounds derive (a) from the same result of theorem [*] and (b) from part (1) of this theorem, respectively:

\begin{eqnarray*}\Vert v_{\lambda}^{\pi_{n}} - v_{\lambda}^{*}\Vert & \leq & \ov...
...
& \leq & \frac{2\lambda^{n}}{1-\lambda}\Vert v_{1} - v_{0}\Vert
\end{eqnarray*}


The last inequality follows form L being a contracting operator, thus ending the proof of part (2)$\Box$

From this theorem it follows that each iteration of the VI algorithm is closer to $v_{\lambda}^{*}$ by a factor of $\lambda$. As $\lambda$ approaches 1, the rate of convergence decreases. The bounds we have shown can be used to determine the number of iterations needed for a specific problem.


  
Figure: Example Diagram


next up previous
Next: Example: Running Value Iteration Up: Finding the Optimal Policy: Previous: Correctness of Value Iteration
Yishay Mansour
1999-12-18