Deep reinforcement learning (DRL) agents are often sensitive to visual changes that were unseen in their training environments. To address this problem, we leverage the sequential nature of RL to learn robust representations that encode only task-relevant information from observations based on the unsupervised multi-view setting. Specifically, we introduce a novel contrastive version of the Multi-View Information Bottleneck (MIB) objective for temporal data. We train RL agents from pixels with this auxiliary objective to learn robust representations that can compress away task-irrelevant information and are predictive of task-relevant dynamics. This approach enables us to train high-performance policies that are robust to visual distractions and can generalize well to unseen environments. We demonstrate that our approach can achieve SOTA performance on a diverse set of visual control tasks in the DeepMind Control Suite when the background is replaced with natural videos. In addition, we show that our approach outperforms well-established baselines for generalization to unseen environments on the Procgen benchmark. Our code is open-sourced and available at https://github. com/BU-DEPEND-Lab/DRIBO.
翻译:深度强化学习( DRL) 代理器通常对培训环境中看不见的视觉变化十分敏感。 为了解决这一问题,我们利用RL的顺序性质来学习强健的演示,该演示方能根据不受监督的多视图设置,从观测中只编码与任务有关的信息。 具体地说,我们为时间数据引入了一个全新的对比版本的多视信息瓶(MIB)目标。 我们用这个辅助目标对像素的RL代理器进行了培训,以学习能够压缩任务相关信息并预测任务相关动态的稳健演示。 这种方法使我们能够培训高性能政策,这种政策对视觉干扰非常有力,并且能够对不见的环境进行普及。 我们证明,当背景被自然视频所取代时,我们的方法可以在DeepMind Contro控制套件中实现多种视觉控制任务的SOTA性能。 此外,我们展示我们的方法超越了在Procgen基准上一般环境的既定基线。 我们的代码是开源的,可以在 https://github. com/BEP- DEEN-LA/DRIB/DRIBO 上查阅。