视觉物体操纵关键点预测模型多模式学习 (Multi-Modal Learning of Keypoint Predictive Models for Visual Object Manipulation)

Humans have impressive generalization capabilities when it comes to manipulating objects and tools in completely novel environments. These capabilities are, at least partially, a result of humans having internal models of their bodies and any grasped object. How to learn such body schemas for robots remains an open problem. In this work, we develop an self-supervised approach that can extend a robot's kinematic model when grasping an object from visual latent representations. Our framework comprises two components: (1) we present a multi-modal keypoint detector: an autoencoder architecture trained by fusing proprioception and vision to predict visual key points on an object; (2) we show how we can use our learned keypoint detector to learn an extension of the kinematic chain by regressing virtual joints from the predicted visual keypoints. Our evaluation shows that our approach learns to consistently predict visual keypoints on objects in the manipulator's hand, and thus can easily facilitate learning an extended kinematic chain to include the object grasped in various configurations, from a few seconds of visual data. Finally we show that this extended kinematic chain lends itself for object manipulation tasks such as placing a grasped object and present experiments in simulation and on hardware.

翻译：人类在全新环境中对物体和工具进行操控时具有令人印象深刻的概括能力。这些能力至少部分是人类拥有自身身体和任何已捕捉对象的内部模型的结果。如何学习机器人的这种身体结构仍是一个开放的问题。在这项工作中, 我们开发了一种自我监督的方法, 它可以在从视觉潜影表象中捕捉物体时扩展机器人的运动模型。我们的框架由两个组成部分组成:(1) 我们展示了一个多式关键点探测器: 一个通过使用自动感知和视觉来预测一个物体的视觉关键点而经过训练的自动读数器结构; (2) 我们展示了我们如何利用我们所学的钥匙探测器从预测的视觉关键点上反射虚拟连接来学习运动链的延伸。我们的评估表明, 我们的方法可以不断预测操纵者手边物体的视觉显示视觉关键点, 从而可以方便学习一个延伸的运动链, 以包括从视觉数据几秒钟到的不同配置中所捕捉到的物体。我们最后展示了这个扩展的游戏链, 将它本身作为一个变动的变形的变形工具。