next up previous
Next: remarks: Up: Q-learning and SARSA algorithms Previous: Q-learning and SARSA algorithms

   
Q-learning

Lets consider Value Iteration Algorithm(VI) from lecture 6.It described the non linear operator: L. In every iteration of the algorithm we operate L:
Vn+1=LVn, and explicitaly:

$V_{n+1} = \max_{a \in A_{s}}\{r(s,a) + \lambda\sum_{s^{'}\in S}
P(s^{'}\vert s,a)V_{n}(s^{'})\}$.

Lets refine the equation a somewhat. We define new function Q regarding VI:

$Q^{n+1}(s,a) = r(s,a) + \lambda\sum_{s^{'}\in S}P(s^{'}\vert s,a)V_{n}(s^{'})$.

Now the iteration of VI are: $V_{n} = \max_{a \in A_{s}}\{Q^{n}(s,a)\}$. Expressed in Q function terms only we have:

$Q^{n+1}(s,a) = r(s,a) + \lambda\sum_{s^{'}\in S}P(s^{'}\vert s,a)\max_{b \in %
A_{s}}\{Q^{n}(s^{'},b)\}$.

We write the iteration with $\alpha-notation$.
(In lecture 7 we learned it converges the right value.)

$Q^{n+1}(s,a) = (1-\alpha)Q^{n}(s,a) + \alpha[ r(s,a) + \lambda\sum_{s^{'}\in %
S}P(s^{'}\vert s,a)\max_{b \in A_{s}}\{Q^{n}(s^{'},b)\}]$

Until now the iterations are equvivalent to VI. Instead of taking the excpetancy of the value of the next step we take a sample of the next step. We assume that we are in state s, we take action a, the next state s' is distributed by P(s'|s,a). Finally we get

$Q^{n+1}(s,a) = (1-\alpha)Q^{n}(s,a) + \alpha[ r(s,a) + \lambda\max_{b \in %
A_{s^{'}}}Q^{n}(s^{'},b)\}]$


  
Figure 9.1: Algorithm for Q-LEARNING
\framebox[\textwidth]{
\begin{minipage}{\textwidth}
\begin{tabbing}
\ \ \ \ ...
...for}\ \-\\
{\small\bf end} \sc q-learning\-\\ 
\end{tabbing}
\end{minipage}} .



 

Yishay Mansour
2000-01-07