![]() |
![]() |
|
Algorithms
|
||
Two
Reinforcement Learning algorithms were used in this project: The Sarsa algorithm
was used to find the agent best navigation policy, and the TD0 algorithm
to evaluate the Sarsa’s results. Algorithm Parameters ·
We chose to relate to the
navigation problem as a finite horizon problem and gave the agent a
fixed life span (just like in real life). Therefore, the LAMBDA
parameter was set to be 1. ·
The ALPHA parameter for
the algorithms was adjusted using the following method: for each state,
the number of updates performed on this state was stored, and upon the
next update the ALPHA parameter was 1/Number of updates for this state.
This way each state’s update ratio is proportional to the number of
times it was updated, regardless of how often it was visited comparing
to other states. (If ALPHA was 1/number of iterations, the less visited
states may not be able to converge properly). MDP-states Each
MDP-state reflects what the agent sees in the given moment according to
its position and orientation. The state number is a generated by a
unique coding reflecting the input view of each of the agent’s eyes. Each
eye input is a combination of two parameters:
The
input value for each eye is therefore:
Therefore,
the number of MDP states is determined by the Number of agent’s eyes,
and number of sections in each eye. Hence,
the number of MDP states was calculated to be: pow(2
* NofSections+1, NofEyes) Reward Since
the life span of the agent is chosen by the user, and might be long, we
decided to give the agent an immediate return after each step. The
immediate return value is calculated based on 2 parameters:
1.
The action the agent performs: we consider a success in navigation if
the agent actually moves forward, a less valued situation is to turn
right or left, and worst action is to stand still (which happens when
the agent is trying to move forward when he is facing a wall).
2.
The kind of landscape the agent is in – the agent is rewarded for
standing on a grassy landscape more than standing on a non-grassy
landscape.
The decision to give only immediate reward
was partly made since with trial and error we discovered that the agent
does not learn well enough given only final reward. This probably has to
do with the fact that for a too long life span the final reward does not
give enough information and does not defuse back to all the states
(unlike the blackjack exercise, in which final return value was given,
where each game was very short and there were considerably less states). The
Sarsa Algorithm This
is an online algorithm for finding the best agent’s policy (meaning
which action to do in each MDP-state). The
algorithm performs a predefined number of iterations. Each
iteration include the following steps:
We
proved in class that under certain assumptions, the algorithm converges. The
TD0 Algorithm This
is an online algorithm, which evaluates a given policy. This evaluation
yields a value for each MDP-state. The score we used was an average over
all states. The
algorithm performs a predefined number of iterations. It updates the
value function of a state by calculating the Temporal Distance between
the value received from moving a step in the world (sampling the world)
and the previous state value. Each
iteration include the following steps:
Algorithms
Implementation Notes ·
In each of the algorithms, the agent’s initial position is chosen
randomly (from all the positions that are not a wall). This is done in
order to give the agent a chance to get to new states (to “see” more
view options).
|
||