In most real world scenarios, a policy trained by reinforcement learning in one environment needs to be deployed in another, potentially quite different environment. However, generalization across different environments is known to be hard. A natural solution would be to keep training after deployment in the new environment, but this cannot be done if the new environment offers no reward signal. Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards. While previous methods explicitly anticipate changes in the new environment, we assume no prior knowledge of those changes yet still obtain significant improvements. Empirical evaluations are performed on diverse simulation environments from DeepMind Control suite and ViZDoom, as well as real robotic manipulation tasks in continuously changing environments, taking observations from an uncalibrated camera. Our method improves generalization in 31 out of 36 environments across various tasks and outperforms domain randomization on a majority of environments.
翻译:在最真实的世界情景中,在一种环境中通过强化学习而培训的政策需要在另一种环境中部署,可能非常不同。然而,人们知道,在不同的环境中,一般化是很困难的。自然的解决办法是,在新环境部署后继续培训,但如果新环境没有提供奖励信号,则无法做到这一点。我们的工作探索使用自我监督来让政策在部署后继续培训,而不使用任何奖励。虽然以前的方法明确预见到新环境的变化,但我们假设这些变化的先前知识还没有显著改善。从DeepMind控制套房和ViZDomoom对不同的模拟环境进行了经验评估,并在不断变化的环境中进行了真正的机器人操纵任务,从一个未经校正的相机中进行观察。我们的方法改进了在36个环境中的31个环境的普及,在大多数环境中,我们的方法超越了区域随机化。