![]() |
![]() |
|
Discussion
|
||
The described project goal was to explore the success of Reinforcement Learning Algorithms in the problem of agent navigation in a virtual world. Aside
from this main purpose, the project also tested a interesting aspect of
Reinforcement Learning Algorithms: How will a result obtained on a
specific learning–scene (in our case, the specific world), will be
scored on different learning scenes? In other words – how will an agent trained in a specific world behave in a different world with different properties? Will it policy be robust enough to yield good scores also in a new unfamiliar world? It
is obvious that the answer to this question depends greatly on the kind
of problem in hand, and in the differences between the worlds. In this
project, the different worlds do not differ greatly – they all contain
walls and grassy areas, and usually right angles between the walls. The
results draw a slightly different picture: On one hand, the agent that
was trained on the Labyrinth world yielded good results in the Grass
Lawn world, the simple world and the “door-3” world, at least in
most occasions. But on the other hand, the agent which was trained on
the simple world and on the “door-3” world obtained quite bad
results in the other worlds. This points to an inherent property of primitive Reinforcement learning Algorithms: The world of the agent is divided into discrete states, and after the training period, the agent always behaves according to the action dictated by its policy for a specific state in the world. The agent cannot infer from learning on a subset of these states to a larger set, even if the overall structure of the world does not change. The learning process is valid only for states which were learnt during the training phase, and the agent will not be able to adjust to new surroundings (unless he will be taught on these surrounding as well). Nevertheless, the agents did learn how to navigate, in most of the worlds, and were able to yielded good scores in some of the new worlds. The policies which resulted from the Sarsa algorithm seem to be good policies to the navigation problem, with the specific reward function we used. In this project we did not try to direct the agent to navigate towards a specific exit or path. In our specific implementation this wasn’t very feasible, since the agent had only limited distance eyes sensors. The agent cannot strive for a target which is not directly in front of its eyes. One way to achieve such a goal is to add to the agent a diffuse sensor (like a nose). It will be fascinating to see how more complex return functions, or more complex agents, will behave.
|
||
|