Use of Linear Programming to Find the Optimal Policy

Next: About this document ... Up: Linear Programming Previous: A Linear Programming Example

Use of Linear Programming to Find the Optimal Policy

The general translation of the problem $x=max\{a_{1},..,a_{n}\}$ to a linear program is as follows:
Minimize: x
Subject To: $\forall i: x \geq a_{i}$
It was shown that the Optimality Equations of the discounted infinite horizon problem are:

$\begin{eqnarray*}v(s) = \max_{a \in A_{s}}\{r(s,a) + \lambda\sum_{j\in S}p(j\mid s,a)v(j)\} \end{eqnarray*}$

Using the above translation we would get the following linear program:
Minimize: $\sum_{j \in S}\alpha(j)v(j)$
Subject To:

$\forall s \in S, \forall a \in A_{s}: v(s) \geq r(s,a) + \lambda\sum_{j \in S}p(j\mid s,a)v(j)$

$\alpha(j)$ is any set of constants with the following characteristics:

$\forall j: \alpha(j)>0$
$\sum_{j}\alpha(j)=1$

These constants can be viewed as the probability distribution of the agent's initial state in the MDP.
Accordingly, the dual linear program would be:
Maximize: $\sum_{s \in S}\sum_{a \in A_{s}}r(s,a) \chi(s,a)$
Subject To:

$\forall j \in S: \sum_{a \in A_{j}} \chi(s,a) = \alpha(j) + \lambda \sum_{a \in A_{s}} \sum_{s \in S} p(j \mid s,a) \chi(s,a)$
$\chi(s,a) \geq 0$

Intuitively, $\chi(s,a)$ can be thought of as the probability of being in state s and performing action a, while taking into account the discount factor - $\lambda$ . In the following theorem $\chi(s,a)$ is defined more formally.

Theorem 6.8 (no proof)

1.

For every $d \in \Pi^{MR}$ , $s\in S$ and $a \in A_{s}$ , let:

$\begin{eqnarray*}\chi_{d}(s,a) = \sum_{j\in S}\alpha(j) \sum_{n=1}^{\infty}\lambda^{n-1} Prob[x_{n}=s\bigwedge y_{n}=a\mid x_{1}=j] \end{eqnarray*}$

$\chi_{d}(s,a)$ is a feasible solution of the dual linear maximization problem.

2.

Let $\chi_{d}(s,a)$ be a feasible solution of the dual linear problem. For every $s\in S$ and $a \in A_{s}$ define d_x(s)such that:

$\begin{eqnarray*}Prob[d_{x}(s)=a]=\frac{\chi(s,a)}{\sum_{\overline{a} \in A_{s}}\chi(s,\overline{a})} \end{eqnarray*}$

and then $\chi_{d_{x}}(s,a)=\chi(s,a).$

Theorem is assuring that every solution to the dual linear program can be translated to a policy with a return value equal to the maximized element in the program. Thus, the best solution of the program can be translated to a policy with the maximal possible return value. This policy will, obviously, be the best policy.

Next: About this document ... Up: Linear Programming Previous: A Linear Programming Example

Yishay Mansour
1999-12-18