next up previous
Next: About this document ... Up: Linear Programming Previous: A Linear Programming Example

   
Use of Linear Programming to Find the Optimal Policy

The general translation of the problem $x=max\{a_{1},..,a_{n}\}$ to a linear program is as follows:
Minimize: x
Subject To:
$\forall i: x \geq a_{i}$
It was shown that the Optimality Equations of the discounted infinite horizon problem are:

\begin{eqnarray*}v(s) = \max_{a \in A_{s}}\{r(s,a) + \lambda\sum_{j\in S}p(j\mid s,a)v(j)\}
\end{eqnarray*}


Using the above translation we would get the following linear program:
Minimize: $\sum_{j \in S}\alpha(j)v(j)$
Subject To:
$\alpha(j)$ is any set of constants with the following characteristics: These constants can be viewed as the probability distribution of the agent's initial state in the MDP.
Accordingly, the dual linear program would be:
Maximize: $\sum_{s \in S}\sum_{a \in A_{s}}r(s,a) \chi(s,a)$
Subject To:
Intuitively, $\chi(s,a)$ can be thought of as the probability of being in state s and performing action a, while taking into account the discount factor - $\lambda$. In the following theorem $\chi(s,a)$ is defined more formally.

Theorem 6.8 (no proof)  
1.
For every $d \in \Pi^{MR}$, $s\in S$ and $a \in A_{s}$, let:

\begin{eqnarray*}\chi_{d}(s,a) = \sum_{j\in S}\alpha(j)
\sum_{n=1}^{\infty}\lambda^{n-1} Prob[x_{n}=s\bigwedge y_{n}=a\mid
x_{1}=j]
\end{eqnarray*}


$\chi_{d}(s,a)$ is a feasible solution of the dual linear maximization problem.
2.
Let $\chi_{d}(s,a)$ be a feasible solution of the dual linear problem. For every $s\in S$ and $a \in A_{s}$ define dx(s)such that:

\begin{eqnarray*}Prob[d_{x}(s)=a]=\frac{\chi(s,a)}{\sum_{\overline{a} \in
A_{s}}\chi(s,\overline{a})}
\end{eqnarray*}


and then $\chi_{d_{x}}(s,a)=\chi(s,a).$

Theorem [*] is assuring that every solution to the dual linear program can be translated to a policy with a return value equal to the maximized element in the program. Thus, the best solution of the program can be translated to a policy with the maximal possible return value. This policy will, obviously, be the best policy.


next up previous
Next: About this document ... Up: Linear Programming Previous: A Linear Programming Example
Yishay Mansour
1999-12-18