Predictive models have been at the core of many robotic systems, from quadrotors to walking robots. However, it has been challenging to develop and apply such models to practical robotic manipulation due to high-dimensional sensory observations such as images. Previous approaches to learning models in the context of robotic manipulation have either learned whole image dynamics or used autoencoders to learn dynamics in a low-dimensional latent state. In this work, we introduce model-based prediction with self-supervised visual correspondence learning, and show that not only is this indeed possible, but demonstrate that these types of predictive models show compelling performance improvements over alternative methods for vision-based RL with autoencoder-type vision training. Through simulation experiments, we demonstrate that our models provide better generalization precision, particularly in 3D scenes, scenes involving occlusion, and in category-generalization. Additionally, we validate that our method effectively transfers to the real world through hardware experiments. Videos and supplementary materials available at https://sites.google.com/view/keypointsintothefuture
翻译:预测模型是许多机器人系统的核心,从梯子到行走机器人,然而,由于图像等高维感官观测,开发和应用这类模型到实际机器人操作一直具有挑战性。以前在机器人操作背景下学习模型的方法要么学习了整个图像动态,要么使用自动编码器在低维潜伏状态中学习动态。在这项工作中,我们引入了以模型为基础的预测,进行自我监督的视觉通信学习,并表明不仅确实有可能这样做,而且表明这些类型的预测模型表明,相对于基于视觉的RL的替代方法而言,通过自动编码器型的视觉培训,具有令人信服的性能改进。通过模拟实验,我们证明,我们的模型提供了更好的一般化精确度,特别是在3D场景,涉及封闭的场景,以及分类化。此外,我们确认,我们的方法通过硬件实验有效地转移到了现实世界。在https://sites.google.com/view/keypointsintintofutreture中可以找到的视频和补充材料。