Most recent 6D object pose estimation methods, including unsupervised ones, require many real training images. Unfortunately, for some applications, such as those in space or deep under water, acquiring real images, even unannotated, is virtually impossible. In this paper, we propose a method that can be trained solely on synthetic images, or optionally using a few additional real ones. Given a rough pose estimate obtained from a first network, it uses a second network to predict a dense 2D correspondence field between the image rendered using the rough pose and the real image and infers the required pose correction. This approach is much less sensitive to the domain shift between synthetic and real images than state-of-the-art methods. It performs on par with methods that require annotated real images for training when not using any, and outperforms them considerably when using as few as twenty real images.
翻译:最近的 6D 对象构成估算方法, 包括不受监督的估算方法, 需要许多真正的培训图像。 不幸的是, 对于某些应用, 比如在空间或水深处的应用, 获取真实图像, 甚至没有附加说明的图像, 几乎是不可能的。 在本文中, 我们提出一种方法, 只能通过合成图像来培训, 或者可以选用另外几个真实图像。 根据从第一个网络获得的粗略的估算, 它使用第二个网络来预测使用粗糙的图像与真实图像之间的密集 2D 对应字段, 并推断出所需的配置校正。 这个方法对合成图像与真实图像之间的域位变化比最新方法要敏感得多。 它与在不使用任何图像时需要附加说明的真实图像来培训的方法一样, 在使用仅以20 个真实图像时, 也大大优于这些图像 。