Introduction


	Discussion

	The described project goal was to explore the success of Reinforcement Learning Algorithms in the problem of agent navigation in a virtual world. Aside from this main purpose, the project also tested a interesting aspect of Reinforcement Learning Algorithms: How will a result obtained on a specific learning–scene (in our case, the specific world), will be scored on different learning scenes? In other words – how will an agent trained in a specific world behave in a different world with different properties? Will it policy be robust enough to yield good scores also in a new unfamiliar world? It is obvious that the answer to this question depends greatly on the kind of problem in hand, and in the differences between the worlds. In this project, the different worlds do not differ greatly – they all contain walls and grassy areas, and usually right angles between the walls. Thus it was expected that if the agent trained to navigate properly in a specific world, it will still behave quiet adequately in a different world. The results draw a slightly different picture: On one hand, the agent that was trained on the Labyrinth world yielded good results in the Grass Lawn world, the simple world and the “door-3” world, at least in most occasions. But on the other hand, the agent which was trained on the simple world and on the “door-3” world obtained quite bad results in the other worlds. The reason for this is quite obvious – since the agent encounters only a few of its possible states in some of the worlds, it is not capable of learning “good” behaviors for other states. This points to an inherent property of primitive Reinforcement learning Algorithms: The world of the agent is divided into discrete states, and after the training period, the agent always behaves according to the action dictated by its policy for a specific state in the world. The agent cannot infer from learning on a subset of these states to a larger set, even if the overall structure of the world does not change. The learning process is valid only for states which were learnt during the training phase, and the agent will not be able to adjust to new surroundings (unless he will be taught on these surrounding as well). Nevertheless, the agents did learn how to navigate, in most of the worlds, and were able to yielded good scores in some of the new worlds. The policies which resulted from the Sarsa algorithm seem to be good policies to the navigation problem, with the specific reward function we used. In this project we did not try to direct the agent to navigate towards a specific exit or path. In our specific implementation this wasn’t very feasible, since the agent had only limited distance eyes sensors. The agent cannot strive for a target which is not directly in front of its eyes. One way to achieve such a goal is to add to the agent a diffuse sensor (like a nose). It will be fascinating to see how more complex return functions, or more complex agents, will behave.

Discussion