In this paper we set out to solve the task of 6-DOF 3D object detection from 2D images, where the only supervision is a geometric representation of the objects we aim to find. In doing so, we remove the need for 6-DOF labels (i.e., position, orientation etc.), allowing our network to be trained on unlabeled images in a self-supervised manner. We achieve this through a neural network which learns an explicit scene parameterization which is subsequently passed into a differentiable renderer. We analyze why analysis-by-synthesis-like losses for supervision of 3D scene structure using differentiable rendering is not practical, as it almost always gets stuck in local minima of visual ambiguities. This can be overcome by a novel form of training, where an additional network is employed to steer the optimization itself to explore the entire parameter space i.e., to be curious, and hence, to resolve those ambiguities and find workable minima.
翻译:在本文中,我们准备解决从 2D 图像中检测 6- DOF 3D 对象的任务, 唯一的监督是 我们所要找到的物体的几何表示。 在此过程中, 我们删除了对 6 DOF 标签( 即位置、 方向等) 的需要, 从而允许我们的网络以自我监督的方式接受无标签图像的培训。 我们通过一个神经网络实现这一点, 该网络可以学习一个清晰的场景参数化, 该参数随后被传递到一个不同的翻版中 。 我们分析为什么使用可变图像来监督 3D 场景结构的 分析- by- 合成性损失不切实际, 因为它几乎总是被困在视觉模糊的地方迷你玛 。 这可以通过一种新式的培训来克服, 即使用一个额外的网络来引导优化自己来探索整个参数空间 i. ( ), 以便好奇, 从而解决这些模糊之处, 并找到可行的迷你 。