Monte Carlo - optimal control.

For each (s,a) and trajectory t, let R’(s,a) be the return of the suffix starting with (s,a).

If the algorithm converges is converges to the optimal policy.

Start with some policy p0.