A prominent approach to visual Reinforcement Learning (RL) is to learn an internal state representation using self-supervised methods, which has the potential benefit of improved sample-efficiency and generalization through additional learning signal and inductive biases. However, while the real world is inherently 3D, prior efforts have largely been focused on leveraging 2D computer vision techniques as auxiliary self-supervision. In this work, we present a unified framework for self-supervised learning of 3D representations for motor control. Our proposed framework consists of two phases: a pretraining phase where a deep voxel-based 3D autoencoder is pretrained on a large object-centric dataset, and a finetuning phase where the representation is jointly finetuned together with RL on in-domain data. We empirically show that our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods. Additionally, our learned policies transfer zero-shot to a real robot setup with only approximate geometric correspondence, and successfully solve motor control tasks that involve grasping and lifting from a single, uncalibrated RGB camera. Code and videos are available at https://yanjieze.com/3d4rl/ .
翻译:视觉强化学习(RL)的突出做法是利用自我监督的方法学习内部国家代表制,这有可能通过额外的学习信号和感官偏差来提高抽样效率和普及性。然而,虽然现实世界本质上是3D,但先前的努力主要侧重于利用2D计算机视觉技术作为辅助自我监督的自我监督。在这项工作中,我们提出了一个自我监督学习3D代表制运动控制的自我监督学习的统一框架。我们提议的框架包括两个阶段:一个培训前阶段,在大型物体中心数据集上预先训练深Voxel基底3D自动计算机,以及一个微调阶段,在内部数据上与RL联合调整。我们从经验上表明,我们的方法在模拟操作任务与2D代表制学习方法相比,具有更高的样本效率。此外,我们所学的政策将零发转为真正的机器人设置,只有近似的几何通信,成功地解决了涉及从单一的、未经校正的 RGB相机中抓取和升动的发动机控制任务。