Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations which can also be used in downstream robotic manipulation tasks.
翻译:以对象为中心的表达方式是前方预测的一个基本抽象。 大多数现有的前方模型通过广泛的监督(例如,对象类和捆绑框)学习了这种表达方式,尽管这种地面真相信息在现实中并不容易获得。 为了解决这个问题,我们引入了 KINet( Keypoint 互动网络) -- -- 一个端到端不受监督的框架,以基于关键点的表达方式来解释物体的相互作用。我们的模式通过视觉观察,学会将物体与关键点坐标联系起来,并发现一个系统图示形式,作为一组关键点嵌入及其关系。然后,它学习一个以行动为条件的前方模型,使用对比性估计来预测未来关键点状态。通过在关键点空间进行物理推理,我们的模型自动概括到不同数量的对象、新背景和看不见的物体的地理特征。实验表明我们模型在准确进行前方预测和学习可规划的物体中心表达方式方面的有效性,这些也可用于下游机器人操纵任务。