Results

Getting Started

In order to understand the results properly we’ll first specify the reward functions that we used. As explained in the “reward’ sub-section of the “Algorithm” section, immediate reward was used in most cases in which substantial results were obtained.

We began with worlds in which there was no grass, and the only goal was for the agents to learn to navigate properly in the world (not bumping into walls).

The immediate reward we used was the highest for moving forward, intermediate for turning right or left, and the lowest for not moving (this happens when the agent tries to move forward while facing a wall).

We did not give higher reward for exploring unvisited areas of the world, thus it is clear that from the given reward function, an iterative route in which the agent does not bump into walls is enough to yields high scores.

 A Complex World

At first we were not able to obtain good results, the agents, even after many Sarsa iterations, did not learn the most basic principal of a proper navigation – avoiding bumping into walls. Our assumption was that the world we used was too complicated– it contained many turns and curves and thus the agent encountered many different states during his navigation (some of them very rarely). This made the learning process tedious and long. In addition, we gave the agents many eyes and many eyes sections, which also increased the number of MDP states.

 

A Simple World

Thus the first step was to let the agent learn in a simpler world, in which the number of states which it encounters is smaller. In addition we used an agent with only 3 eyes, and one or two eyes sections.

 In this world, the agent learned to navigate quite quickly. The best agent did not bump into walls at all, and moved forward whenever it was possible.

Labyrinth World

Now we moved to a more challenging world, the labyrinth world.
Just like in the previous world, the agent learned to navigate properly. The best agent did not bump into walls at all.

An interesting experiment we conducted was to place this agent (which was trained within the Labyrinth world), in the simple world and observe the way it will navigate there. We have found that it displayed the same behavior and skills as the agent which was trained originally in the simple world.

The Labyrinth world, being more complicated than the simple world, includes more MDP states, but since we used the same reward function, the resulted policy was the same for the states that were encountered by the agent in both worlds.

Adding Grass – The good life

The next step was to add Grass lawns to our worlds. The grass areas were expected to be a preferred location for the agent to be in, since the agent gets a higher reward function when standing on grass. The agent was modified to be able to spot grass when viewed in one of his eye sections. The agent was expected to navigate in the direction of the grass once he views grass, and stay in grassy lawns after entering them. Again, when the grass is out of agent’s sight, he will not go to its direction, since he does not have any other sensor, which can direct him to the grass. Thus if there is a wall between the agent and the grass, the agent will not “see” the grass (his eyes will bump into the wall first), and as a result will not navigate to its direction.

 We added grassy areas to the Labyrinth world. The areas were added in specific locations, which forced the agent to turn right or left in order to enter the grass area. This way, the agent could not enter a grassy area just by moving forward, but rather had to actively prefer turning towards the grass (when viewed by one of its eyes).

 Additionally, all grassy areas in the Labyrinth world were surrounded by walls.

 

 The best agent learned to navigate properly, as it did in the Labyrinth world without the grass, but in addition, once the agent viewed a grass area, it turned in its direction, entered the grassy area, and continued navigating inside the grassy area.

 Policy Verification – The Grass Lawn World

In the labyrinth world all grassy areas were surrounded by walls, thus this world did not really verify what would the agent do when located inside a grassy area, not surrounded by walls. Will he exit the area and keep on navigating till he finds another grass? Or will he stay in the grass lawn? In order to test this we created another world: the “Grass Lawn”. This world has no walls, but it has a large grass lawn in the middle of it. It was used to check whether the agent prefers grass over non-grass under no other navigation constrains.

 

As one can see from the navigation route below, the agent which was trained in the grassy Labyrinth world preferred grass over non-grass, and once the agent entered the grass area, it stayed there and navigated within the grass area.

“What’s Behind Door 3” World

An exciting experiment was to see whether we can use our agent for a “practical” assignment – a grass finder. We created a very specific world for this assignment in which the agent can explore five different aisles where only one of these aisles ends up with a grassy area.

The agent can see the grass only if he is entering inside the aisle, thus cannot know in advance whether he should enter a specific aisle and he should explore them all in order to find the grassy heaven. Once it does enter the right aisle and finds the grass, he should stay there.

The algorithm yielded an agent that had these exact properties. They kept on navigating inside the different aisle, but once entered the grassy aisle, they stayed inside it (and call their masters if they could).

 

 A Moral Dilemma: To turn or not to turn?

Since the agent trained in “What’s behind Door 3” world always sees the grass from the same position (moving forward, the grass is right ahead of it) it learned how to act in this specific situation, but did not learn how to act in situations which it did not encounter.

When training the 3 eyes agent, the agent always sees the grass with the middle eye – the one looking forward.

When we put this agent in the Labyrinth World, it did not behave properly. In the Labyrinth world, a 3-eyes agent can view the grass only with its left or right eye, while viewing a wall or nothing in the middle eye. (This was done intentionally as to make the agent turn into the grass area and not just continue walking straight ahead towards it).

The resulting agent behavior is as follows: as long as it does not see grass, it navigates properly without bumping into walls and without stopping. Once the agent sees the grass in his left or right eye, it turns towards the grass and then away from it (to its original direction), and then towards the grass again indefinitely. It looks as if the agent does not know which option to choose – turn towards the grass, or keep navigating forward (this is of course an anthropomorphism of the agent, but at this stage we do believe it has a rather humanoid personality). This policy results in endless turns in which agent does not move forward at all.