Locating 3D objects from a single RGB image via Perspective-n-Points (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, so that 2D-3D point correspondences can be partly learned by backpropagating the gradient w.r.t. object pose. Yet, learning the entire set of unrestricted 2D-3D points from scratch fails to converge with existing approaches, since the deterministic pose is inherently non-differentiable. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose on the SE(3) manifold, essentially bringing categorical Softmax to the continuous domain. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle unifies the existing approaches and resembles the attention mechanism. EPro-PnP significantly outperforms competitive baselines, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation and nuScenes 3D object detection benchmarks.
翻译:从一个 RGB 图像中通过透视点定位 3D 对象是一个长期存在的计算机视觉问题。 在端到端深的学习驱动下,最近的研究表明,将 PnP 解释成一个不同的层,这样2D-3D 点对应物可以部分地通过对梯度(w.r.t.t) 构成的反射来学习。然而,从零到零学习整个一组不受限制的 2D-3D 点无法与现有方法相融合,因为确定性因素本质上是无法区分的。在本文中,我们提议了EPro-PnP,一个用于一般端到端的概率性PnP 层构成一个估计,在 SE(3) 的方位分布,基本上将绝对的软通缩到连续域。 2D-3D 坐标和相应的权重被视为通过最大限度地缩小预测和目标分布之间的 KL 差异而学到的中间变量。 基本原则将现有方法与关注机制相类似。 EPro- PnP 明显地超越竞争基线, 3D 目标检测S 基准 之间在 PnP- 基准线 3D 之间的差距 。