next up previous
Next: About this document ... Up: No Title Previous: SARSA

   
Convergence proof

The general idea is writing our algorithm as:

$Q \leftarrow (1-\alpha)Q + \alpha(HQ+w)$

where:

$(HQ)(s,a) = r(s,a)+\lambda\sum_{s^{'}}P(s^{'}\vert s,a)^{max}_{b}Q(s^{'},b)$

We will show that H is a contracting mapping (operator):
$\vert\vert HQ_{1} - HQ_{2}\vert\vert = \vert\vert\lambda P[MQ_{1} - MQ_{2}]\vert\vert$
where (MQ1)(s)=maxaQ1(s,a). Note that MQ1=V1. We have,
$\vert\vert HQ_{1} - HQ_{2}\vert\vert \leq \lambda\vert\vert V_{1} - V_{2}\vert\vert\leq \lambda\vert\vert Q_{1} - Q_{2}\vert\vert$,
Therefore H is $\lambda$ contracting.
The parameter w is "noise" in the sample whose expectation is 0. By using the contracting property of H we show convergence. For convergence we require:

Theorem 9.1 (A general theorem for Stochastic processes)  
Let: $r_{t+1}(i)=(1-\alpha_{t}(i))r_{t}(i)+\alpha_{t}(i)[(Hr_{t})(i)+w_{t}(i)]$

Given that:
1.
E[wt(i)|Ft]=0
2.
$E[w_{t}^{2}(i)\vert F_{t}] \leq A+B\vert\vert r_{t}\vert\vert^{2}$ for some A,B constants.
3.
the steps $\alpha_{t}(i)$ are such that: $\sum\alpha_{t}(i)=\infty$ and $\sum\alpha_{t}^{2}(i)<\infty$
4.
the mapping H (operator) is contracting
Then $r_{t} \rightarrow r^{*}$ with probability 1 .
(where R* is the fixed point of Hr*=r*)

In the case of Q-Learning it means that HQ*=Q* and Q* is the optimal value function,
(this is the same operator as in Value Iteration). Thus we achieve by the convergence of Q-Learning the optimal policy value.

Theorem 9.2   If all r(s,a) are non-negative and bounded and Q0(s,a)=0 then
$Q^{(n)} \rightarrow Q^{*}$

For the theorem to be true we assumed that we sample each (s,a) infinite number of times,
thus we could look at each sequence separately.

How to assure that we visit all (s,a) ?
1.
If we follow the greedy policy of Q(n) it may be not possible to check all pairs (s,a)
even if we reach all states $s \in S$
2.
We will follow from a state s the $\varepsilon$-greedy policy. With probability $1-%
\varepsilon$ we choose the greedy action from s,
and do a random policy from s with probability $\varepsilon$ .
3.
Choose the policy by:

$\Pi^{n}(s,a)= \frac{exp{-Q(s,a)/T}}{\sum_{b}exp{-Q(s,b)/T}}$
T determines if we do $\varepsilon$-greedy action (T $\rightarrow 0$) or random action (T $\rightarrow \infty$). We can look at this as if we part the time into sections of explorations(random) and of explotation(greedy).

On all 3 methods we can decide if we want to exploit the knowledge we have or get more information about the MDP

The next theorem is a weaker version of the main theorem.

Theorem 9.3   Let $r_{t+1}=(1-\frac{1}{t})r_{t}+\frac{1}{t}(Hr_{t}+w_{t})$
Assume that:
1.
E[wt]=0 and $\vert w_{t}\vert\leq 1$
2.
H is a contracting mapping
3.
$\vert r_{t}\vert \leq D_{0}$
Then $r_{t} \rightarrow r^{*}$ such that Hr* = r*

Proof:Since H is contracting,
$\vert\vert Hr-Hr^{*}\vert\vert\leq \beta \vert\vert r-r^{*}\vert\vert$ for some $\beta < 1$.
Whithout loss of generality, assume that r*=0,
therefore $\vert\vert Hr\vert\vert\leq \beta \vert\vert r\vert\vert$.

Define: $D_{k+1} = (\beta+2\varepsilon)D_{k}$ such that $\beta + 2\varepsilon<1$.
We show by induction that exists a time tk such that for every $t \geq t_{k}$ then $\vert\vert r_{t}\vert\vert \leq D_{k}$

We bound rt for t>tk. since $\vert\vert r_{t}\vert\vert \leq D_{k}$ then $\vert\vert Hr_{t}\vert\vert \leq \beta D_{k}$ (the induction hypothesis).
We have: $r_{t+1}=\frac{t-1}{t}r_{t}+\frac{1}{t}(Hr_{t}+w_{t})=\frac{t-l-1}{t}r_{t-%
l}+\frac{1}{t}\sum_{i=t-l}^{t}Hr_{i}+\frac{1}{t}\sum_{i=t-l}^{t}w_{i}$.
choose l=t-tk then : $\vert\vert r_{i}\vert\vert\leq D_{k}$ and $\vert\vert Hr_{i}\vert\vert\leq \beta D_{k}$ .
Since E[wi]=0 and $\vert w_{i}\vert\leq 1$ we have that: $\frac{1}{t}\sum_{i=t-%
l}^{t}w_{i}\rightarrow 0$. Therefore $\frac{1}{t}\sum_{i=t-l}^{t}w_{i} \leq \varepsilon D_{k}$ for all but a finite number of t.

We continue: $r_{t} \leq \frac{t_{k}-1}{t}r_{t_{k}}+\frac{1}{t}\sum_{i=t-%
l}^{t}Hr_{i}+\frac...
...frac{t_{k}}{t}D_{k}+\frac{t-%
t_{k}}{t}\beta D_{k}+\frac{t-t_{k}}{t}\varepsilon$

For $t \geq t_{k}/\varepsilon$ we get:
$r_{t} \leq \varepsilon D_{k} + \beta D_{k}+\varepsilon D_{k} = %
(\beta+2\varepsilon)D_{k}=D_{k+1}$

Therefore exists tk+1 such that $r_{t}\leq D_{k+1}$
Thus when $t\rightarrow \infty$ then $\vert\vert r_{t}\vert\vert \rightarrow r^{*}=0$. $\Box$


next up previous
Next: About this document ... Up: No Title Previous: SARSA
Yishay Mansour
2000-01-07