Our work focuses on training RL agents on multiple visually diverse environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that a value network in the multi-environment setting is more challenging to optimize and prone to memorizing the training data than in the conventional single-environment setting. In addition, we find that appropriate regularization on the value network is necessary to improve both training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network. This can be implemented using a single unified network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency on the Procgen Benchmark.
翻译:我们的工作重点是在多个视觉多样的环境中培训RL代理商,以提高观察性一般化绩效。在以往的方法中,政策和价值网络被分别优化,使用互不相连的网络结构来避免干扰,并获得更准确的价值功能。我们发现,在多环境环境中,一个价值网络比传统的单一环境环境环境更难优化并易于对培训数据进行记忆。此外,我们发现,有必要对价值网络进行适当的正规化,以改进培训和测试性能。为此,我们提议了一种政策梯度算法,即通过比政策网络更频繁地利用更多的培训数据优化价值网络,暗含惩罚价值估计。这可以通过单一的统一网络结构加以实施。此外,我们引入了一个简单的自我监督任务,利用单一的区分器来学习环境的前向和反向动态,这可以与价值网络共同优化。我们提议的算法极大地改进了Procgen基准的观察性一般化绩效和抽样效率。