Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.
翻译:以视觉模型为基础的强化学习( RL) 具有从视觉观察中学习样本效率高的机器人的潜能。 然而,目前的方法通常是训练一个单一的模型端到端,以学习视觉表现和动态,因此很难准确地模拟机器人和小天体之间的互动。 在这项工作中,我们引入了一个基于视觉模型的RL框架,将视觉演示学习和动态学习脱钩。具体地说,我们训练一个具有革命层和视觉变异器的自动编码器(VT),以重建具有隐蔽的共变形特征的像素,并学习一个在自动编码器的演示上运行的潜在动态模型。此外,为了对任务相关信息进行编码,我们为自动编码者引入了一个辅助性奖赏预测目标。我们不断利用从环境互动中收集的在线样本更新自动编码和动态模型。我们展示了我们的脱钩方法在Meta-world和RLBench等各种视觉机器人任务上取得了最先进的表现。例如,我们在Meta-world和RBench,例如,我们在Metal-w网站的50个视觉机器人操纵任务上实现了81.7%的成功率: http am/gless/ salmlve