Recent advances in unsupervised learning for object detection, segmentation, and tracking hold significant promise for applications in robotics. A common approach is to frame these tasks as inference in probabilistic latent-variable models. In this paper, however, we show that the current state-of-the-art struggles with visually complex scenes such as typically encountered in robot manipulation tasks. We propose APEX, a new latent-variable model which is able to segment and track objects in more realistic scenes featuring objects that vary widely in size and texture, including the robot arm itself. This is achieved by a principled mask normalisation algorithm and a high-resolution scene encoder. To evaluate our approach, we present results on the real-world Sketchy dataset. This dataset, however, does not contain ground truth masks and object IDs for a quantitative evaluation. We thus introduce the Panda Pushing Dataset (P2D) which shows a Panda arm interacting with objects on a table in simulation and which includes ground-truth segmentation masks and object IDs for tracking. In both cases, APEX comprehensively outperforms the current state-of-the-art in unsupervised object segmentation and tracking. We demonstrate the efficacy of our segmentations for robot skill execution on an object arrangement task, where we also achieve the best or comparable performance among all the baselines.
翻译:在不受监督的物体探测、分解和跟踪学习方面最近的进展为机器人应用带来了巨大的希望。 一种共同的方法是将这些任务作为概率性潜变模型中的推论来设定。 但是,在本文中,我们显示,目前与视觉复杂场景(例如通常在机器人操纵任务中遇到的)的艺术状态斗争,我们建议了APEX,这是一个新的潜伏可变模型,能够在更现实的场景中分割和跟踪物体,其对象在大小和纹理上差异很大的物体,包括机器人臂本身。这是通过有原则的掩码正统算法和高分辨率场景编码器实现的。为了评估我们的方法,我们在现实世界的Sketschy数据集中展示了结果。然而,这一数据集并不包含地面真相面具和对象身份标识,用于定量评估。因此,我们引入了Panda 推动数据集(P2D),该模型显示在模拟桌上与对象进行波达臂互动,其中包括地面分解面具和用于跟踪的对象身份。 在两种情况下, APEX 全面超越了我们当前分段的运行状态, 也展示了我们最佳的轨道, 我们在机器人分级中展示了我们的最佳分级执行。