\documentstyle[12pt,psfig]{book}
\setlength{\evensidemargin}{.0 in} \addtolength{\topmargin}{-1in}
\setlength{\textwidth}{6.5in} \setlength{\textheight}{8in}
\newcommand{\remove}[1]{}
\newcommand{\Do}{{\small\bf do}\ }
\newcommand{\Return}{{\small\bf return\ }}
\newcommand{\Proc}[1]{#1\+}
\newcommand{\Returns}{{\small\bf returns}}
\newcommand{\Procbegin}{{\small\bf begin}}
\newcommand{\If}{{\small\bf if}\ \=\+}
\newcommand{\Then}{{\small\bf then}\ \=\+}
\newcommand{\Else}{\<{\small\bf else}\ \>}
\newcommand{\Elseif}{\<{\small\bf elseif}\ }
\newcommand{\Endif}{\<{\small\bf end\ if\ }\-\-}
\newcommand{\Endproc}[1]{{\small\bf end} #1\-}
\newcommand{\For}{{\small\bf for}\ \=\+}
\newcommand{\Endfor}{{\small\bf end\ for}\ \-}
\newcommand{\Loop}{{\bf loop}\ \= \+}
\newcommand{\Endloop}{{\bf end\ loop}\ \-}
\newenvironment{program}{
\begin{minipage}{\textwidth}
\begin{tabbing}
\ \ \ \ \=\kill
}{
\end{tabbing}
\end{minipage}
}
\def\blankline{\hbox{}}
\newcommand{\lecture}[7]{
\pagestyle{headings}
\thispagestyle{plain}
\newpage
\setcounter{chapter}{#1}
\setcounter{page}{#2}
% \set\thechapter{#3}
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 6.28in { {\bf Computational Learning Theory
\hfill Fall Semester, 1999/00} }
\vspace{4mm}
\hbox to 6.28in { {\Large \hfill Lecture #1: #3 \hfill} }
\vspace{2mm}
\hbox to 6.28in { {\it Lecturer: #4 \hfill Scribe: #5, #6} }
}
}
\end{center}
\markboth{Lecture #1: #3}{Lecture #1: #3}
\vspace*{4mm}
}
\newcommand{\topic}[2]{\section{#1} \index{#2} \markright{#1}}
\newcommand{\subtopic}[2]{\subsection{#1} \index{#2}}
\newcommand{\subsubtopic}[2]{\subsubsection{#1} \index{#2}}
\renewcommand{\cite}[1]{[#1]}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\be}{\begin{enumerate}}
\newcommand{\ee}{\end{enumerate}}
\newcommand{\blank}{\vspace{1ex}} % generates a blank line in the output
\newtheorem{theorem}{Theorem}[chapter]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newcommand{\qed}{\hfill $\Box$}
\newenvironment{proof}{\par{\bf Proof:}}{\qed \par}
\newenvironment{dfn}{{\vspace*{1ex} \noindent \bf Definition }}{\vspace*{1ex}}
\newcommand{\bigdef}[2]{\index{#1}\begin{dfn} {\rm #2} \end{dfn}}
\newcommand{\smalldef}[1]{\index{#1} {\em #1}}
% **** YOUR NOTES GO HERE ****************************************************
\begin{document}
\lecture{8}{1}{December 16}{Yishay Mansour}{Alexander
Rusinov}{Moti Alperovitch}
\topic{Evaluation Policy Reward using MC and TD(0)}{1}
\subtopic {Another way to look on MC algorithm}{1}
In lecture 7 we discussed Monte-Carlo (MC) method for evaluation
policy reward.
\\ This method performs number of experiments and uses the average to
evaluate policy reward.\\
Another way to express the evaluation is the following:\\ $\
V_{n+1}(s) = V_{n}(s) + \alpha(R_{n}(s) - V_{n}(s)) $\ \\ where $\
R_{n}$\ is total reward of $n$-th run starting first visit in $s$
(given $s$ was visited in this run).\\ Note that $\ E[R_{n}(s)] =
V^{\pi}(s) $\\\
We can rewrite this formula ,as follows,\\$\ V_{n+1}(s) =
(1-\alpha)V_{n}(s) + \alpha[V^{\pi}_{n}(s) + (R_{n}(s) -
V^{\pi}_{n}(s))]$\ \\
where: $\ HV_{n} = V^{\pi}_{n+1} $\ - for some nonlinear operator
$H$ ,and $\ W_{n} = R_{n} - V^{\pi}_{n} $\ is "noise" and $\
E(W_{n}) = 0 $\.
\\
Recall the operator $\ L_{\pi}\vec{V} = \vec{R_{\pi}} + \lambda
P_{\pi}\vec{V} $\ , that was introduced to compute the return of
policy. We've already shown that $\ L_{\pi} $\ is a contracting
operator.\\
\subtopic {Temporal Difference and TD(0) algorithm}{1}
We can write down algorithm according to the defined operator:\\
\\ $\ V_{n+1}(s) \leftarrow (1-\alpha_{n}) V_{n}(s)+
\alpha_{n}[r_{\pi}(s)+ \lambda E_{s}[V(s')]] $\ \\
Instead of computing expectation $\ E_{s}$\ we sample the next
state $\ s'$\ and derive\\
\\ $\ V_{n+1}(s) \leftarrow (1-\alpha_{n}) V_{n}(s)+
\alpha_{n}[r_{\pi}(s)+ \lambda V_{n}(s')] $\ or equivalently,
\\ $\ V_{n+1}(s) \leftarrow V_{n}(s)+ \alpha_{n}[r_{\pi}(s)+ \lambda
V_{n}(s') - V_{n}(s)] $\ \\
We'll call $\ r_{\pi}(s)+ \lambda V_{n}(s') - V_{n}(s)$\ -
temporal difference (TD) because this expression is difference
between reward received in current run $\ r_{\pi}(s)+ \lambda
V_{n}(s') $\ and expected reward $\ V_{n}(s) $\.\\
This algorithm is called TD(0).\\
It is easy to see that $\ E(r_{\pi}(s)+ \lambda V_{n}(s')) =
E(V_{n}(s))$\ and $\ E(TD) = 0 $\ \\
The following theorem is used to show convergence of the
algorithm:
\begin{theorem}\label{L1:CONVERGENCE}
Consider the iterative algorithm $\ r_{t+1}(i) \leftarrow
(1-\alpha_{t}(i)) r_{t}(i)+ \alpha_{t}(i)[(Hr_{t})(i)+ w_{t}(i)]
$\ such that
\\1) $\ E(w_{t}(i)|F_{t}) = 0 $\ and $\
E(w^2_{t}(i)|F_{t}) = A + B||r_{t}||^2 $\ for some constants $A$
and $B$
\\2) the step size $\ \alpha_{t} $\ has the property that $\ \sum\alpha_{t} = \infty $\
and $\ \sum\alpha^2_{t} < \infty; $\
\\3) H is a contracting mapping.
\\ Then r converges to $\ r^{*} $\ with probability one, when $\ r^{*}
= H r^{*} $\
\\
\end{theorem}
Recall that convergence with probability one implies that sequence
$\ A_{n} $\ has the property that $\lim_{N\rightarrow\infty}sup_{n
\geq N}|A_{N} - A| = 0 $\. The above theorem implies that TD(0)
converges with probability one to the correct value function (same
as MC algorithm).\\
\subtopic {Online Versus Off-Line Updates}{1}
The basic advantage of TD(0) is in possibility to perform online
updates based on information received up to current point. In
TD(0) all we need to know to perform update is the next state.\\
Here is an example of TD(0)-like algorithm run :\\ An employee
estimates time he need to get home from work.
\begin{enumerate}
\item State: Leaving for home ; Elapsed time : 0 ; Estimated time to
goal: 30 ; Total estimation: 30
\item State: Got into the car, It's raining ; Elapsed time : 5 ;
Estimated time to goal: 35 ; Total estimation: 40
\item State: Getting out of highway ; Elapsed time : 20 ;
Estimated time to goal: 15 ; Total estimation: 35
\item State: There is big truck ahead; Elapsed time : 30 ;
Estimated time to goal: 10 ; Total estimation: 40
\item State: Got to the right street ; Elapsed time : 40 ;
Estimated time to goal: 3 ; Total estimation: 43
\item State: Got home ; Elapsed time : 43 ;
Estimated time to goal: 0 ; Total estimation: 43 \\
\end{enumerate}
For MC-like algorithm we need to finish the run for making
updates. For TD(0) updates are made online (hopefully in right
direction) and in MC all updates are made only after a run is
finished.\\
\subtopic {Differences between TD(0) and MC }{1}
If we run TD(0) and MC for long enough time (and gives them enough
data) both algorithms will converge to the correct value function.
However on the finite input , algorithms can give different
results.\\
\\ For understanding basic difference between MC and TD(0) we'll
study following example:\\ Let us assume that we have run series
of runs and got following results:
\begin{enumerate}
\item (A,0) (B,0)
\item (B,1)
\item (B,1)
\item (B,1)
\item (B,1)
\item (B,1)
\item (B,1)
\item (B,0) \\
\end{enumerate}
We can create infinite sample from such a finite sample as
follows. Each time we need to sample a run, we choose a random run
out of the eight runs.\\
We want to estimate V(A) and V(B) using this finite sample.\\ For
V(B) we have 8 runs, 6 give return 1 and 2 of them give return 0.
Our estimate for V(B) is $\ \frac{3}{4}$\.\\ How should we
estimate V(A)? There are two reasonable options.\\
\begin{enumerate}
\item We have one run visiting state A with return 0, so
V(A) estimation is 0.
\item We have one run visiting state A and this run moves
from state A to state B. So we can assume that we always move from
state A to state B. This implies that estimation is V(A) = V(B) =
$\ \frac{3}{4}.$\
\\
\end{enumerate}
Clearly first approach corresponds to MC. Now we show that TD(0)
implements second approach. It means that TD(0) evaluation is in
fact done using Markovian model produced by inspecting input : for
any state $s$ an action $a$ the next state distribution is
identical to the one observed in the sample.\\
\begin{figure}
\begin{picture}(400,150)(0,25) % begins picture environment
\put(100,80){\circle{25}} \put(100,80){\makebox(0,0)[c]{A}}
\put(200,80){\circle{25}} \put(200,80){\makebox(0,0)[c]{B}}
\put(300,120){\circle{25}} \put(300,120){\makebox(0,0)[c]{C}}
\put(300,40){\circle{25}} \put(300,40){\makebox(0,0)[c]{D}}
\put(112.5,80){\line(1,0){75}} \put(212.5,80){\line(2,1){75}}
\put(212.5,80){\line(2,-1){75}}
\put(115,95){\makebox(0,0)[bl]{ $\ p = 1, r = 0$\ }}
\put(203,35){\makebox(0,0)[bl]{ $\ p = \frac{3}{4}, r = 1$\ }}
\put(203,120){\makebox(0,0)[bl]{ $\ p = \frac{1}{4}, r = 0$\ }}
\put(272,40){\makebox(0,0)[bl]{ $\ \triangle$\ }}
\put(272,112){\makebox(0,0)[bl]{ $\ \nabla$\ }}
\put(172,76){\makebox(0,0)[bl]{ $\ >$\ }}
\end{picture}
\caption{Empirical model.} % the figure caption (numbered by latex)
\label{L2:mypicture}
\end{figure}
\begin{claim}
Given finite sample $\ T_{1}...T_{l} $\ . If we run TD(0) until it
converges (sampling every time randomly one of the runs $\
T_{1}...T_{l} $\ ) then estimation $\ \widehat{V}^{\pi} $\ equals
to return of policy $\ \pi $\ in empirical model for $\
T_{1}...T_{l} $\ .
\end{claim}
\begin{dfn}\\
Given runs $\ T_{1}...T_{l} $\ , the empirical model is defined as
follows: \\ $\ \forall s \in S , a \in A: $\ $\
\widehat{p}(s'|s,a) = \frac{\#(s,a,r,s')}{\#(s,a)} $\ and $\
\widehat{r}(s,a) = Avg(r(s,a)) $\ \\ where $\ \#(s,a)$\ is the
number of times in $\ T_{1}...T_{l} $\ we did action $a$ in state
$s$ and $\ \#(s,a,r,s') $\ is the number of times when after doing
action $a$ we reached state $s'$.
\end{dfn}\\
$\ \mathbf{Explanation}: $\ It is easy to see from the definition
of TD(0) that based are impacted only on successive pairs of
states $\ \#(s,a,r,s') $\.\\
$\ V_{n+1}(s) \leftarrow V_{n}(s) + \alpha_{n} [\widehat{r} +
\lambda V_{n}(s') - V_{n}(s)] $\ \\
Number of updates for state $\ s' $\ is distributed according to
$\ \widehat{p} $\ and so: \\
$\ V_{n+1}(s) \leftarrow V_{n}(s) + \alpha_{n} [\widehat{r} +
\lambda E_{\widehat{p} \sim s'} [V_{n}(s')]] $\ \\
We received equations for model with $\ \widehat{p} $\ and $\
\widehat{r} $\, which is exactly the empirical model.\\
$\ \mathbf{Conclusion}.$\ Here is conceptual difference between MC
and TD(0): \\ MC considers only information about initial state of
the run (received at the end of the run) and TD(0) assumes that
there is Markovian model and makes use of this assumption.\\
\topic{TD($\ \gamma $\ ) algorithm}{1}
\subtopic {TD($\ \gamma $\ ) as generalization of MC and TD(0)
methods}{1}
MC uses overall return of the run to perform update, i.e., $\
R_{t}=\sum_{i=0}^{T}\lambda^{i}r_{t+i}$\ \\ TD(0) uses only
immediate reward to make update and its update has the form, \\ $\
R_{t}^{(1)}=r_{t} + \lambda V_{t}(s_{t+1}) $\ \\
We can think about other options between those two extremes. For
example we can use two immediate reward values , i.e. , $\
R_{t}^{(2)}=r_{t} + \lambda r_{t+1} + \lambda^{2} V_{t}(s_{t+2})
$\ \\
Generally - if $n$ values are in use, we have, $\
R_{t}^{(n)}=r_{t} + \lambda r_{t+1} + ... + \lambda^{n-1} r_{n-1}
+ \lambda^{n} V_{t}(S_{n}) $\ \\ Note: If $\ n $\ is more than
number of steps to the end of the run, we set all the rewards
after the run end to 0.\\
The update is defined as $\ \triangle V_{t}(s_{t}) =
\alpha[R_{t}^{(n)}-V_{t}(s_{t})] $\ \\
Update can be performed:
\begin{enumerate}
\item online: $\ V_{t+1}(s_{t}) \leftarrow V_{t}(s_{t}) + \triangle
V_{t}(s_{t}) $\
\item off-line: $\ V(s) \leftarrow V(s) + \sum\triangle V_{t}(s)
$\ \\
\end{enumerate}
It is easy to see that operator $\ R_{t}^{(n)}$\ is $\
\lambda^{n}$\ contracting. \\ $\ [ \|R_{t}^{(n)}(V^{(1)}) -
R_{t}^{(n)}(V^{(2)})\| = \lambda^{n}\|V^{(1)} - V^{(2)}\| ] $\ \\
So we expect to get improvement and reduce error while making
iterations.\\
The central remaining question is choice of $\ n $\ . One way to
do it is to use weighed average of all possible values. \\
The method $\ TD(\gamma) $\ gives a geometric distribution with
parameter $\ \gamma $\ over $\ R_{t}^{(n)} $\ ,\\ i.e. weight of
$\ R_{t}^{(n)} $\ is $\ (1-\gamma) \gamma^{n-1} .$\ \\
We define $\ R_{t}^{\gamma} $\ to be $\ R_{t}^{\gamma} =
(1-\gamma)\sum_{n=1}^{\infty}\gamma^{n-1}R_{t}^{(n)} $\ \\ \\ For
finite run this expression can be rewritten as : \\ $\
R_{t}^{\gamma} = (1-\gamma)\sum_{n=1}^{T-1}\gamma^{n-1}R_{t}^{(n)}
+ \gamma^{T}R_{t}$\ \\
\begin{enumerate}
\item If $\ \gamma=0$\ then weight of $\ R_{t}^{(1)} $\ is 1 and weights
of any other $\ R_{t}^{(n)} $\ equals 0. This reduces TD(0), which
we introduced before.
\item For $\ \gamma = 1 $\ we'll get $\ R_{t}^{(1)} = R_{t} $\
(total reward) . \\
\end{enumerate}
\subtopic {Forward Updates Versus Backward Updates}{1}
In all definitions of $\ TD(\gamma) $\ we looks on updates
occurring in forward direction in time (looking to the "future")
like in MC method. We want to define scheme for online updates and
so we need to find the way to look on updates in backward
direction ("past"). We like that at the end of the run, updates
would be the same for both schemes.
\\
For implementation of such method, an additional variable which
holds weight of a state in a run will be provided for every run.
Every step we multiply all those variables (called eligibility
trace) by $\ \gamma\lambda $\ and for current state we add 1.
Formally,\\
$\ e_{t}(s) = \gamma\lambda e_{t-1}(s) $\ if $\ s\neq s_{t} $\
\\
$\ e_{t}(s) = \gamma\lambda e_{t-1}(s)+1 $\ if $\ s = s_{t} $\
\\
Every time we visit a state , its weight is going up and for
states not visited theirs weight is going down. Let us define: $\
\delta_{t} = r_{t+1} + \lambda V_{t}(s_{t+1}) - V_{t}(s_{t})$\. \\
Now the update would be, $\ \Delta V_{t}(s) = \alpha \delta_{t}
e_{t}(s) $\.\\
\subsubtopic {Algorithm}{1}
The algorithm is as follows:
\begin{enumerate}
\item Define an arbitrary $\ V(s) $\ , $\ \forall s \in S $\ set $\ e(s) =
0 $\
\item Repeat for every run : Set initial state $\ s_{0} $\
\item Repeat for every step of the run: \\
$\ \cdot $\ Choose action according to policy $\ \pi : a
\leftarrow \pi (s) $\ \\
$\ \cdot $\ Perform action $\ a $\ , receive next state $\ s' $\
and immediate reward $\ r_{t} $\ \\
$\ \cdot \delta_{t} \leftarrow r_{t} + \lambda V_{t}(s') -
V_{t}(s) $\
\\
$\ \cdot e_{t+1}(s) \leftarrow \gamma\lambda e_{t}(s)+1 $\
\\
$\ \cdot \forall f \in S $\ set $\ e_{t+1}(f) \leftarrow
\gamma\lambda e_{t}(f) $\ for $\ f \neq s $\ and $\ V_{t+1}(f)
\leftarrow V_{t}(f)+ \alpha\delta e_{t}(f) $\
\end{enumerate}
$\ \cdot $\ repeat steps 2,3 until final state is reached (for
every run)\\
Using this scheme, we update previous values according to new
immediate reward and according to state we have reached. For more
distant states weight of the update is smaller and for more close
states the weight is larger. \\
Let us examine different values of $\ \gamma $\ . \\
$\ \cdot $\ For $\ \gamma = 0 $\ : $\ e(s) = 1 $\ iff $\ s =
s_{t} $\ and $\ e(s) = 0 $\ otherwise \\ So, we will update only
current state according to received reward and we once again have
the TD(0) scheme.\\
$\ \cdot $\ For $\ \gamma = 1 $\ : $e(s)$ is contribution of this
step to state s, i.e. if s is visited in the run in steps $\
i_{1},...,i_{l} $\ then $\ e_{t}(s) = \sum_{j=1}^{l}
\lambda^{l-i_{j}} $\ \\ where $\ \lambda^{l-i_{j}} $\ is
contribution of current step to step $\ i_{j} $\ \\
Although TD(1) resembles MC , there is a minor difference : MC
waits for the end of the run to perform updates while TD(1)
performs updates online. By the end of the run, both methods give
the same value function , but functions computed by TD(1) will be
hopefully more accurate during the run, since they are updated
online.
\\
\subtopic {Equality of schemes using backward and forward
updates}{1} Now we will show that "looking forward" and "looking
backward" schemes are equal (i.e. will give same results). \\ Let
us say: \\
$\ \Delta V_{t}^{\gamma}(s_{t}) = \alpha [R_{t}^{\gamma} -
V_{t}(s_{t})] $\ for the forward scheme\\
$\ \Delta V_{t}^{TD}(s) = \alpha \delta_{t} e_{t}(s) $\ for the
backward scheme\\ We will show that \\
$\ \Sigma_{t=0}^{T-1} \Delta V_{t}^{TD}(s) = \Sigma_{t=0}^{T-1}
\Delta V_{t}^{\gamma}(s_{t}) I(s,s_{t}) $\ \\ where, \\ $\
I(s,s_{t}) = 1 $\ if $\ s = s_{t} $\ ,and $\ I(s,s_{t}) = 0 $\ if
$\ s \neq s_{t} $\ \\
We can rewrite $\ e_{t}(s) $\ as $\ e_{t}(s) = \Sigma_{k=0}^{t}
(\gamma \lambda)^{t-k} I(s,s_{k}) $\ \\
We have for the left side of the equation:\\ \\ $\
\Sigma_{t=0}^{T-1} \Delta V_{t}^{TD}(s) = \Sigma_{t=0}^{T-1}
\alpha \delta_{t} \Sigma_{k=0}^{t} (\gamma \lambda)^{t-k}
I(s,s_{k}) $\ \\ \\ = $\ \Sigma_{k=0}^{T-1} \alpha
\Sigma_{t=0}^{k} (\gamma \lambda)^{k-t} I(s,s_{t}) \delta_{k} $\
\\ \\ $\ = \Sigma_{t=0}^{T} \alpha \Sigma_{k=0}^{T-1} (\gamma
\lambda)^{k-t} I(s,s_{t}) \delta_{k} $\ \\ \\ $\ =
\Sigma_{t=0}^{T-1} \alpha I(s,s_{t}) \Sigma_{k=t}^{T-1} (\gamma
\lambda)^{k-t} \delta_{k} $\ \\ \\
And for the right side: \\ \\ $\ \frac{1}{\alpha}
\Delta_{t}^{\gamma} (s) = R_{t}^{\gamma} - V_{t}(S) $\ \\ $\ =
-V_{t}(s_{t}) + (1-\gamma)\gamma^{0}[r_{t} + \gamma V_{t}(s_{t})]
+ (1-\gamma)\gamma[r_{t} + \gamma r_{t+1} + \gamma^{2}
V_{t}(s_{t})] + $\ ... \\ \\ $\ = -V_{t}(s_{t}) +
\Sigma_{i=0}^{\infty} (\lambda\gamma)^{i} [r_{t+i} + \lambda
(1-\gamma) V_{t}(s_{t+i+1})] $\ \\ \\ $\ = -V_{t}(s_{t}) +
\Sigma_{i=0}^{\infty} (\lambda\gamma)^{i} [r_{t+i} + \lambda
V_{t}(s_{t+i+1}) - V_{t}(s_{t+i})] $\ \\ \\ $\ =
\Sigma_{k=t}^{\infty} (\lambda\gamma)^{k-t} \delta_{k} =
\Sigma_{k=t}^{T-1} (\lambda\gamma)^{k-t} \delta_{k} $\ \\ \\
So, we have : \\ $\ \Sigma_{t=0}^{T} \Delta V_{t}^{\gamma} (s_{t})
I(s,s_{t}) = \Sigma_{t=0}^{T-1} \alpha I(s,s_{t})
\Sigma_{k=t}^{T-1} (\lambda\gamma)^{k-t} \delta_{k} $\ \\
We have for right side the same expression we have already for the
left side of the equation. This implies that both methods give
same updates at the end of the run.
\\
\end{document}