Vision-based reinforcement learning (RL) is successful, but how to generalize it to unknown test environments remains challenging. Existing methods focus on training an RL policy that is universal to changing visual domains, whereas we focus on extracting visual foreground that is universal, feeding clean invariant vision to the RL policy learner. Our method is completely unsupervised, without manual annotations or access to environment internals. Given videos of actions in a training environment, we learn how to extract foregrounds with unsupervised keypoint detection, followed by unsupervised visual attention to automatically generate a foreground mask per video frame. We can then introduce artificial distractors and train a model to reconstruct the clean foreground mask from noisy observations. Only this learned model is needed during test to provide distraction-free visual input to the RL policy learner. Our Visual Attention and Invariance (VAI) method significantly outperforms the state-of-the-art on visual domain generalization, gaining 15 to 49% (61 to 229%) more cumulative rewards per episode on DeepMind Control (our DrawerWorld Manipulation) benchmarks. Our results demonstrate that it is not only possible to learn domain-invariant vision without any supervision, but freeing RL from visual distractions also makes the policy more focused and thus far better.
翻译:基于愿景的强化学习(RL)是成功的,但如何将其推广到未知的测试环境仍然具有挑战性。现有方法侧重于培训一个通用的 RL 政策,以改变视觉领域,而我们则侧重于提取通用的视觉前景,为RL 政策学习者提供清洁的变异视觉。我们的方法完全无人监督,没有手动说明或环境内部环境访问。在培训环境中,我们从行动视频中学会了如何以不受监督的钥匙检测方式提取前台,随后又学会了不受监督的视觉关注,以自动生成每个视频框架的地表面遮罩。我们随后可以引入人造分散器,并训练一个模型,从噪音的观测中重建清洁的地表面面具。在测试期间只需要这种学习过的模型来为RL 政策学习者提供无干扰的视觉输入。我们的视觉关注和不轨迹(VAI)方法大大超越了视觉域常规化状态,获得的15-49%(61-229%),然后在视频框架中自动生成更多的累积性奖赏。我们的结果显示,只有深光谱控制(我们的世界绘图-Wor Wor Wor Worm Minipliplipal dislightlation) 的视野定位没有任何更清晰的视野定位基准。