In continuous control, exploration is often performed through undirected strategies in which parameters of the networks or selected actions are perturbed by random noise. Although the deep setting of undirected exploration has been shown to improve the performance of on-policy methods, they introduce an excessive computational complexity and are known to fail in the off-policy setting. The intrinsically motivated exploration is an effective alternative to the undirected strategies, but they are usually studied for discrete action domains. In this paper, we investigate how intrinsic motivation can effectively be combined with deep reinforcement learning in the control of continuous systems to obtain a directed exploratory behavior. We adapt the existing theories on animal motivational systems into the reinforcement learning paradigm and introduce a novel and scalable directed exploration strategy. The introduced approach, motivated by the maximization of the value function's error, can benefit from a collected set of experiences by extracting useful information and unify the intrinsic exploration motivations in the literature under a single exploration objective. An extensive set of empirical studies demonstrate that our framework extends to larger and more diverse state spaces, dramatically improves the baselines, and outperforms the undirected strategies significantly.
翻译:在连续控制中,勘探往往是通过非定向战略进行的,其中网络参数或选定行动的参数受到随机噪音的干扰。虽然无定向勘探的深层环境表明可以改善政策方法的性能,但是它们带来了过度的计算复杂性,而且已知在非政策环境中是失败的。具有内在动机的勘探是非定向战略的有效替代办法,但通常针对离散的行动领域进行研究。在本文件中,我们研究了如何有效地将内在动机与在控制连续系统以获得定向探索行为的深度强化学习结合起来。我们把现有的动物动力系统理论纳入强化学习范式,并引入了新颖和可缩放的定向勘探战略。由价值函数错误最大化驱动的引入方法可以从收集的一套经验中获益,方法是提取有用信息,将文献中固有的勘探动机统一到一个单一的勘探目标之下。一系列广泛的实证研究表明,我们的框架扩展到了更大和更多样化的状态空间,大大改进了基线,并大大超越了非定向战略。