next up previous
Next: First Visit Up: Evaluating Policy Reward Previous: Evaluating Policy Reward

   
The Naive Approach

for each $s\in S$ :
Run
$\pi$ from s for m times, where the i-th run is Ti.
Let
ri be the reward of Ti.
Estimate the reward of policy
$\pi$ starting from s, by :

$\hat{V^{\pi}}(s) = Avg(r_{i})= \frac{1}{m}\sum_{i=1}^{m}r_{i}$

The variables
ri are independent since the runs Ti are independent. By Chernoff's theorem :

$Prob[\vert\hat{V^{\pi}(s)}-V^{\pi}(s)\vert\geq \gamma]\leq 2e^{-2m\gamma^{2}}$

This implies that :

$Prob[\exists s \in S : \vert\hat{V^{\pi}(s)}-V^{\pi}(s)\vert \geq \gamma] \leq ...
...}-V^{\pi}(s)\vert \geq \gamma] \leq \vert S\vert 2e^{-2m\gamma^{2}} = \epsilon $.

for
$\gamma = \sqrt{\frac{ln(\vert S\vert)/\epsilon}{2m}}$ the above holds.



Yishay Mansour
1999-12-16