Monte Carlo - optimal control.
Start with some policy p0.
Use pi to generate a trajectory t.
For each (s,a) and trajectory t, let R’(s,a) be the return of the suffix starting with (s,a).
Q (s,a) = average R’(s,a).
Let pi be the e-greedy policy of Q(s,a).
PROBLEM: R’(s,a) includes returns of various policies.
If the algorithm converges is converges to the optimal policy.
Open Problem: Does it converge?