We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world.
翻译:我们引入了一个新颖的机器人系统,通过利用机器人与物体的长期互动来改善现实世界中不可见物体的分化。 以往的方法要么抓住, 要么按住一个对象, 然后在一次动作后获得被捕捉或被推动对象的分化面罩。 相反, 我们的系统在机器人推动动作的顺序后推迟关于分解对象的决定。 通过对通过机器人推动收集的图像应用多物体跟踪和视频对象分化, 我们的系统可以以自我监督的方式生成这些图像中所有对象的分解面罩。 其中包括物体彼此非常接近的图像, 以及这些图像中通常会为现有物体分解网络发生的分解错误。 我们通过精细调整的分解网络来展示我们系统的效用, 利用我们系统收集的真实世界数据对合成数据进行培训。 我们显示, 经过微调后, 网络的分化精度在同一个领域和不同领域都有显著改善。 此外, 我们核实微调整的网络改进了实际世界内对看不见物体的自上下机体捕捉取。