This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.
翻译:本研究提出了一种自监督学习系统,用于分割RGB图像中的刚体对象。所提出的管道在移动机器人携带的未标记RGB-D视频上进行训练,这些视频可以捕获静态对象。自监督训练过程的关键特征是在每个视频中从点云的过分割输出上运行的图形匹配算法。图形匹配以及点云配准能够找到不同视频中的重复对象模式,并将它们合并成三维对象伪标签,即使在遮挡或不同的视角下也是如此。从三维伪标签中投影出的二维对象掩码通过对比学习用于训练像素级特征提取器。在在线推理过程中,聚类方法使用学习到的特征将前景像素聚集成对象段。实验突出了该方法在真实和合成视频数据集上的有效性,这些数据集包括桌面物品的混乱场景。该方法比现有的无监督对象分割方法表现更优。