Algorithms

 Two Reinforcement Learning algorithms were used in this project:

The Sarsa algorithm was used to find the agent best navigation policy, and the TD0 algorithm to evaluate the Sarsa’s results.

Algorithm Parameters

·       We chose to relate to the navigation problem as a finite horizon problem and gave the agent a fixed life span (just like in real life). Therefore, the LAMBDA parameter was set to be 1.

·     The ALPHA parameter for the algorithms was adjusted using the following method: for each state, the number of updates performed on this state was stored, and upon the next update the ALPHA parameter was 1/Number of updates for this state. This way each state’s update ratio is proportional to the number of times it was updated, regardless of how often it was visited comparing to other states. (If ALPHA was 1/number of iterations, the less visited states may not be able to converge properly).

 MDP-states

Each MDP-state reflects what the agent sees in the given moment according to its position and orientation. The state number is a generated by a unique coding reflecting the input view of each of the agent’s eyes.

Each eye input is a combination of two parameters:

  • The object that the sight line encountered (wall, grass or nothing).
  • The distance (measured only by the section in which it was seen).

 The input value for each eye is therefore:

  • 0 to NofSection-1: a wall was encountered in section 1 to NofSections
  • NofSection to 2*NofSection-1: a grass was encountered in section 1 to NofSections
  • 2*NofSection: there is nothing within the view distance of this eye.

Therefore, the number of MDP states is determined by the Number of agent’s eyes, and number of sections in each eye.

Hence, the number of MDP states was calculated to be:

pow(2 * NofSections+1, NofEyes)

Reward

Since the life span of the agent is chosen by the user, and might be long, we decided to give the agent an immediate return after each step. The immediate return value is calculated based on 2 parameters:

       1.         The action the agent performs: we consider a success in navigation if the agent actually moves forward, a less valued situation is to turn right or left, and worst action is to stand still (which happens when the agent is trying to move forward when he is facing a wall).

       2.         The kind of landscape the agent is in – the agent is rewarded for standing on a grassy landscape more than standing on a non-grassy landscape.


Note that the chosen reward determines the agent’s behavior in the world. Since no reward was given for searching for new positions in which the agent never visited, there is no reason for the agent, under the current reward mechanism, won’t choose as its best policy a policy which will make it repeating the same route again and again. The current reward function was chosen since our main goal was to prevent agent from walking into walls and from halting, and in addition to encourage it to stay in grass areas.

 The decision to give only immediate reward was partly made since with trial and error we discovered that the agent does not learn well enough given only final reward. This probably has to do with the fact that for a too long life span the final reward does not give enough information and does not defuse back to all the states (unlike the blackjack exercise, in which final return value was given, where each game was very short and there were considerably less states).

The Sarsa Algorithm

This is an online algorithm for finding the best agent’s policy (meaning which action to do in each MDP-state).

The algorithm performs a predefined number of iterations.

Each iteration include the following steps:

  • The agent chooses the epsilon-greedy action for its state, according to the matrix of states and actions values, which is updated by the algorithm.
  • It then gets its new state according to the action done.
  • Then it updates the Value matrix in the following way: A second action is chosen in an epsilon-greedy fashion from the new state, and the immediate return value to which this action leads to is calculated. The value of the previous state and action are updated according to the immediate reward, and the value of the new state and action.

We proved in class that under certain assumptions, the algorithm converges.

The TD0 Algorithm

This is an online algorithm, which evaluates a given policy. This evaluation yields a value for each MDP-state. The score we used was an average over all states.

The algorithm performs a predefined number of iterations. It updates the value function of a state by calculating the Temporal Distance between the value received from moving a step in the world (sampling the world) and the previous state value.

 Each iteration include the following steps:

  • The agent moves a step in the world, according to the action chosen by its policy for its current state.
  • The value of the next state and the immediate return value are calculated, and the previous agent’s state is updated according to them.

 Algorithms Implementation Notes

·          In each of the algorithms, the agent’s initial position is chosen randomly (from all the positions that are not a wall). This is done in order to give the agent a chance to get to new states (to “see” more view options).

When choosing a more complicated agent (an agent which has many eyes and/or many sections in each eye), the number of agent’s MDP states grows. Therefore, it takes more iterations for the algorithm to converge. In general the algorithms yielded better results when the agent has a moderate amount of states.