Humans learn to better understand the world by moving around their environment to get more informative viewpoints of the scene. Most methods for 2D visual recognition tasks such as object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence in multiple views. Generalization to novel scenes and views thus requires additional training with lots of human annotations. In this paper, we propose a self-supervised framework to improve an object detector in unseen scenarios by moving an agent around in a 3D environment and aggregating multi-view RGB-D information. We unproject confident 2D object detections from the pre-trained detector and perform unsupervised 3D segmentation on the point cloud. The segmented 3D objects are then re-projected to all other views to obtain pseudo-labels for fine-tuning. Experiments on both indoor and outdoor datasets show that (1) our framework performs high-quality 3D segmentation from raw RGB-D data and a pre-trained 2D detector; (2) fine-tuning with self-supervision improves the 2D detector significantly where an unseen RGB image is given as input at test time; (3) training a 3D detector with self-supervision outperforms a comparable self-supervised method by a large margin.
翻译:人类通过环绕环境学习如何更好地了解世界,以获得对场景的更多信息观点。 2D 视觉识别任务的大多数方法,例如物体探测和分解,将同一场景的图像作为单个样本处理,而不是在多个视图中利用物体的永久性。 将普通化的场景和观点纳入新颖的场景和观点,因此需要用大量的人类说明进行更多的培训。 在本文中,我们提议一个自我监督的框架,通过在3D环境中移动一个代理物,并汇集多视图 RGB-D 探测器,来改进在不可见情景中的物体探测器。 我们不预测的 2D 对象在预先训练的探测器上探测,并在点云上进行不受监督的 3D 分割。 然后,对3D 片断的物体进行重新投射到所有其他观点,以获得用于微调的假标签。 在室内和室外的数据集实验显示:(1) 我们的框架从原始的 RGB-D D 数据和预先训练的2D 号探测器中进行高品质的3D 分解;(2) 我们的自我监督图像大大改进了2D 探测器,在点云中,在点云中将一个可视的RGB 图像作为可视的自我测试的自我定位,在大规模测试中用一个大的自我定位比差差差差差差的自我测试。(3)。