Object-centric representation is an essential abstraction for physical reasoning and forward prediction. Most existing approaches learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions in complex systems based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, and novel object geometries. Experiments demonstrate the effectiveness of our model to accurately perform forward prediction and learn plannable object-centric representations which can also be used in downstream model-based control tasks.
翻译:以物体为中心的表达方式是物理推理和前瞻性预测的基本抽象。 大多数现有方法都通过广泛的监督(例如物体类别和捆绑框)来学习这种表达方式,尽管这种地面真实性信息在现实中并不容易获得。 为了解决这个问题,我们引入了KINet(Keypoint互动网络) -- -- 一个端到端的、不受监督的框架,以基于关键点的表达方式来解释在复杂系统中的物体相互作用。我们模型利用视觉观察,学会将物体与关键点坐标联系起来,并发现一个系统图示,作为一组关键点嵌入及其关系。然后,它学习一个以行动为条件的前瞻性模型,使用对比性估计来预测未来关键点状态。通过学习在关键点空间进行物理推理,我们的模型自动对不同物体和新物体的形状进行概括。实验表明我们的模型在准确进行前瞻性预测和学习可规划的物体中心化表达方式方面的有效性,这些方法也可以用于下游模式的控制任务。