Standard deep reinforcement learning algorithms use a shared representation for the policy and value function, especially when training directly from images. However, we argue that more information is needed to accurately estimate the value function than to learn the optimal policy. Consequently, the use of a shared representation for the policy and value function can lead to overfitting. To alleviate this problem, we propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic. First, IDAAC decouples the optimization of the policy and value function, using separate networks to model them. Second, it introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment. IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors. Our implementation is available at https://github.com/rraileanu/idaac.
翻译:标准强化学习算法对政策和价值功能使用共享的表示法,特别是在直接从图像中进行培训时。然而,我们争辩说,需要更多信息来准确估计价值功能,而不是学习最佳政策。因此,对政策和价值函数使用共享的表示法可能导致过度适应。为了缓解这一问题,我们建议了两种方法,这些方法合在一起创建国际开发协会:不易脱钩的优势行为者-批评。首先,国际开发协会用不同的网络来模拟政策和价值函数的优化,并使用不同的网络来模拟它们。第二,它引入了一种辅助性损失,鼓励这种表示法对与环境有关的任务特性不起作用。国际开发协会向看不见的环境展示了良好的概括性,实现了关于Procgen基准的新状态,并超越了对分心器的深度控制任务的流行方法。我们的实施可以在 https://github.com/rreileanu/idaac上查阅。