Deep reinforcement learning (DRL) agents are often sensitive to visual changes that were unseen in their training environments. To address this problem, we leverage the sequential nature of RL to learn robust representations that encode only task-relevant information from observations based on the unsupervised multi-view setting. Specifically, we introduce an auxiliary objective based on the multi-view in-formation bottleneck (MIB) principle which quantifies the amount of task-irrelevant information and encourages learning representations that are both predictive of the future and less sensitive to task-irrelevant distractions. This enables us to train high-performance policies that are robust to visual distractions and can generalize to unseen environments. We demonstrate that our approach can achieve SOTA performance on diverse visual control tasks on the DeepMind Control Suite, even when the background is replaced with natural videos. In addition, we show that our approach outperforms well-established baselines for generalization to unseen environments on the Procgen benchmark. Our code is open-sourced and available at https://github.com/JmfanBU/DRIBO.
翻译:深强化学习( DRL) 代理器往往对培训环境中看不见的视觉变化十分敏感。 为了解决这一问题,我们利用RL的顺序性质来学习强健的表达方式,这些表达方式只根据不受监督的多视图设置,从观测中编码与任务有关的信息。 具体地说,我们引入了一个基于多视图成型瓶颈(MIB)原则的辅助目标,该原则量化了任务相关信息的数量,并鼓励学习表达方式,既能预测未来,又对任务无关的分散变化不太敏感。 这使得我们能够培训对视觉分心力很强并能概括到看不见环境的高性能政策。 我们证明,我们的方法可以在深最小控制套的多种视觉控制任务上实现SOTA绩效,即使背景被自然视频所取代。 此外,我们展示了我们的方法在Procgen基准上超越了一般环境的既定基线。 我们的代码是公开来源,可以在https://github.com/JmfanBU/DRIBO上查阅。