We focus on the problem of training RL agents on multiple training environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that the value network in the multiple-environment setting is more challenging to optimize and prone to overfitting training data than in the conventional single-environment setting. In addition, we find that appropriate regularization of the value network is required for better training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), which implicitly penalizes the value estimates by optimizing the value network less frequently with more training data than the policy network, which can be implemented using a shared network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency in the Procgen Benchmark.
翻译:我们注重在多种培训环境中培训RL代理人员以提高观察性一般化绩效的问题。在以往的方法中,政策和价值网络被分别优化,使用互不相连的网络结构避免干扰,并获得更准确的价值功能。我们发现,在多种环境环境中,价值网络比传统的单一环境环境中更难优化,更易对培训数据进行过度匹配。此外,我们发现,需要适当规范价值网络,才能更好地进行培训和测试性能。为此,我们提议延迟批评性政策分级(DCPG),通过利用比政策网络更多的培训数据优化价值网络,从而不那么频繁地对价值估计进行惩罚。此外,我们引入了简单的自我监督任务,利用单一的导师来学习环境的前瞻性和反向动态,而后者可以与价值网络共同优化。我们提议的算法极大地改进了Procgen基准的观察性一般化性能和抽样效率。