Exploration is crucial for training the optimal reinforcement learning (RL) policy, where the key is to discriminate whether a state visiting is novel. Most previous work focuses on designing heuristic rules or distance metrics to check whether a state is novel without considering such a discrimination process that can be learned. In this paper, we propose a novel method called generative adversarial exploration (GAEX) to encourage exploration in RL via introducing an intrinsic reward output from a generative adversarial network, where the generator provides fake samples of states that help discriminator identify those less frequently visited states. Thus the agent is encouraged to visit those states which the discriminator is less confident to judge as visited. GAEX is easy to implement and of high training efficiency. In our experiments, we apply GAEX into DQN and the DQN-GAEX algorithm achieves convincing performance on challenging exploration problems, including the game Venture, Montezuma's Revenge and Super Mario Bros, without further fine-tuning on complicate learning algorithms. To our knowledge, this is the first work to employ GAN in RL exploration problems.
翻译:对于培训最佳强化学习(RL)政策来说,探索对于培训最佳强化学习(RL)政策至关重要,关键在于区分国家访问是否是新奇的。以前的大部分工作都侧重于设计休养规则或远程测量标准,以检查国家是否新奇,而不考虑可以学习的这种歧视过程。在本文中,我们提出一种叫作基因式对抗性探索(GAEX)的新颖方法,鼓励在RL进行探索,方法是引入一个基因化对立网络的内在奖赏产出,在网络中,发电机提供有助于歧视者识别受访次数较少的国家的假样品。因此,鼓励代理人访问那些歧视者不太敢于判断的州。GAEX很容易执行,培训效率也很高。在我们的实验中,我们将GAEX应用到DQN和DQN-GAEX算法在挑战性勘探问题上取得了令人信服的表现,包括游戏风险、Montezuma的Revenge和超级Mario Bros,而没有进一步微调复杂的学习算法。 据我们所知,这是在RL勘探问题中使用GAN的首项工作。