Markovian Policy

Next: The Return Function Up: Infinite Horizon Problems Previous: The Expected Discounted Sum

Markovian Policy

We will now show for every initial state, s, and a history dependent policy, $\pi$ , a Markovian policy $\pi^{'}$ such that the distribution on (X_t, Y_t) is equal for $\pi$ and $\pi^{'}$ .

Theorem 4.3 Let $\pi=(d_1, d_2, ...) \in \Pi^{HR}$ .
Then $\forall s \in S$ there exists a Markovian stochastic policy $\pi^{'}=(d_1^{'}, d_2^{'}, ...) \in \Pi^{MR}$ ,
that sutisfies $Prob_{\pi^{'}}[X_t=j,Y_t=a \vert X_1=s] = Prob_{\pi}[X_t=j,Y_t=a \vert X_1=s]$

Proof:For every $j \in S$ and $a \in A_j$ we define $\pi^{'}$ as follows:
$q_{d_t^{'}(j)}(a) = Prob_{\pi}[Y_t=a \vert X_t=j, X_1=s]$
We will first show that this definition results in the same distibution over the group of actions:

$\begin{eqnarray*}Prob_{\pi^{'}}[Y_t=a \vert X_t=j] & = & Prob_{\pi^{'}}[Y_t=a \vert X_t=j, X_1=s]\\ & = & Prob_{\pi}[Y_t=a \vert X_t=j, X_1=s] \end{eqnarray*}$

The first equality is derived from the fact that $\pi^{'}$ is Markovian.
The second equality is by definition.

It is left to be shown that the distribution over the group of states is equal under $\pi$ and $\pi^{'}$ , i.e.
$Prob_{\pi^{'}}[X_t=j \vert X_1=s] = Prob_{\pi}[X_t=j \vert X_1=s]$
We will prove this part by an induction on t. The idea behind the proof of this part is that if at a certain step we have an equal distribution over the group of states, and we are taking the same stochastic action, we will endup with same distribution over the group of states.
Basis: for $t = 1 \;\;\;\; X_t=s$ in $\pi$ and in $\pi^{'}$
Induction Step: We assume that there is an identity between the distribution over the group of states in $\pi$ and in $\pi^{'}$ until the time t-1

$\begin{eqnarray*}Prob_{\pi}[X_t=j \vert X_1=s] & = & \sum_{k \in S} \sum_{a \in ... ... X_1=s]p(j\vert k, a)\\ & = & Prob_{\pi^{'}}[X_t=j \vert X_1=s] \end{eqnarray*}$

Recalling that $Prob_{\pi^{'}}[X_t=j,Y_t=a \vert X_1=s] = Prob_{\pi^{'}}[X_t=j \vert X_1=s] \cdot Prob_{\pi^{'}}[Y_t=a \vert X_t=j, X_1=s]$
concludes this proof. $\Box$

The Return Function

Next: The Return Function Up: Infinite Horizon Problems Previous: The Expected Discounted Sum

Yishay Mansour
1999-11-18