next up previous
Next: About this document ... Up: Finite Horizon Previous: Computational Complexity

   
The Optimality Principal

Let us define: \(U^*_t(h_t) = \max_{\pi\in{\Pi^{HR}}} \{U^{\pi}_t(h_t)\}\). For the simplicity of the discussion we assume that both S and A are finite. The optimality equations (also known as Bellman equations) are:

$U_t(h_t) = \max_{a\in A} \{r_t(s_t,a) + \sum_{j\in S} P_t(j\vert s_t,a)U_{t+1}(h_t,a,j)\},$

Where UN(hN)=rN(sN) for hN=(hN-1,aN-1,sN).

Note that replacing the max operation in the equation with the action taken by policy $\pi$, we get the equation for computing the return of policy $\pi$.

Theorem 3.2   Let Ut be the solution for the optimality equations for t = 1 up to N-1, then:

(1) $\forall t \leq N-1, \forall h_t\in H_t, \; u_t(h_t)=u^*_t(h_t)$

(2) $\forall s_1 \in S, \; u_1(s_1)=V^*_N(s_1)$

Proof:First, we will show by induction that $u_t(h_t) \geq u^*_t(h_t)$ for $1\leq t\leq N$ and for every $h_t\in H_t$. Then we exhibit a specific policy $\pi$ for which $V^{\pi}_t(h_t)=u_t(h_t)$ and that this will conclude the proof.

Basis: When t=N, by definition $u_N(h_N)=r_N(s_N)=u^{\pi}_N(h_N)$ for every policy $\pi$ and therefore uN(hN)=u*N(hN).

Induction Step: Let us assume that $u_t(h_t) \geq u^*_t(h_t)$ for $n+1 \leq t \leq N$ and $h_t\in H_t$. Let $\pi'$ be a policy in $\Pi^{HR}$ (that performs the operation dn' on step n). for t=n:

$u_n(h_n) = \max_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j\vert s_n,a)u_{n+1}(h_n,a,j)\}$

Consider an arbitrary policy $\pi'$, that for history hn uses stochastic action q(a). By the induction assumption,

\begin{eqnarray*}u_n(h_n) & \geq & \max_{ a \in A} \{ r(s_n,a) + \sum_{j \in S} ...
...vert s_n,a) u^{\pi'}_{n+1}(h_n,a,j)\} \\
& = & u^{\pi'}_n(h_n)
\end{eqnarray*}


Therefore, $u_n(h_n) \geq U^{\pi'}_n(h_n)$ for every $\pi'$ and in particular to the optimal policy, $u_n(h_n)\geq u^*_n(h_n)= \max_{\pi'} u^{\pi'}_n(h_n)$. It is left to show that there exists a policy $\pi'$ for which $V^{\pi'}_N(h_n)=u_n(h_n)$. Let us define $\pi'$ in the following manner:

$\pi'(h_n) = argmax_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j\vert s_n,a)u_{n+1}(h_n,a,j) \}$

We will now show that $u_n(h_n)=u^{\pi'}_n(h_n)$ using reversed induction on n.

Basis: for n=N this is obviously true.

Induction Step:

\begin{eqnarray*}u^{\pi'}_n(h_n) & = & r(s_n,\pi'(h_n)) + \sum_{j \in S} P_n(j\v...
...\in S} P_n(j\vert s_n, \pi'(h_n))u_{n+1}(h_n) \\
& = & u_n(h_n)
\end{eqnarray*}


The second equality is by the induction hypothesis, that $u^{\pi'}_t=u_t(h_t)$ for $t \geq n+1$. Therefore $u^{\pi'}_n(h_n)=u_n(h_n)$ and since $u^*_n(h_n) \geq u^{\pi'}_n(h_n)$ it is clear that un(hn)=u*n(hn) as was required.$\Box$


next up previous
Next: About this document ... Up: Finite Horizon Previous: Computational Complexity
Yishay Mansour
1999-11-15