Exploration is an essential part of reinforcement learning, which restricts the quality of learned policy. Hard-exploration environments are defined by huge state space and sparse rewards. In such conditions, an exhaustive exploration of the environment is often impossible, and the successful training of an agent requires a lot of interaction steps. In this paper, we propose an exploration method called Rollback-Explore (RbExplore), which utilizes the concept of the persistent Markov decision process, in which agents during training can roll back to visited states. We test our algorithm in the hard-exploration Prince of Persia game, without rewards and domain knowledge. At all used levels of the game, our agent outperforms or shows comparable results with state-of-the-art curiosity methods with knowledge-based intrinsic motivation: ICM and RND. An implementation of RbExplore can be found at https://github.com/cds-mipt/RbExplore.
翻译:探索是强化学习的一个重要部分,它限制了学习政策的质量。硬探索环境是由巨大的国家空间和微薄的回报来定义的。在这样的环境下,对环境的彻底探索往往是不可能的,对代理人的成功培训需要许多互动步骤。在本文中,我们提议了一种称为回滚-Explore(RbExplore)的探索方法,它利用了持续Markov决定程序的概念,在培训期间,代理人可以返回到访问过的州。我们在Persia的硬勘探王子游戏中测试我们的算法,没有奖赏和领域知识。在所有使用的游戏级别,我们的代理人都超越或展示了与基于知识的内在动机:ICM和RND等最新好奇方法的类似结果。 RbExplore的实施可以在https://github.com/cds-mipt/RbExplore找到。