Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark. Our code is available at https://github.com/tjiiv-cprg/EPro-PnP-v2.
翻译:EPro-PnP:广义端到端概率透视n点法用于单目物体位姿估计
翻译摘要:
定位单个RGB图像中的3D物体通过透视nP是计算机视觉上长期存在的问题。近期的研究推导出了将nP作为可微分层的方式,由此驱动端到端深度学习,并允许通过回传姿态损失的梯度来部分学习2D-3D点对应关系。然而,从头开始学习完整的对应关系是极具挑战性的,特别是对于具有歧义性的姿态解决方案,其中全局最优姿态在理论上对于点不可微分。在本文中,我们提出了EPro-PnP,一种用于一般化端到端姿态估计的概率nP层,它输出了在SE(3)流型上不同iable概率密度的姿态分布。将2D-3D坐标和相应的权重视为中间变量,通过最小化预测和目标姿态分布之间的KL差异来学习它们。这一原则泛化了以前的方法,类似于注意机制。EPro-PnP可以增强现有的对应网络,弥合nP-based方法和LineMOD 6DoF位姿估计基准测试的任务特定领域之间的差距。此外,EPro-PnP有助于探索网络设计的新可能性,我们展示了新型可变形对应网络在nuScenes 3D目标检测基准测试中的最高位姿准确度。我们的代码可在https://github.com/tjiiv-cprg/EPro-PnP-v2找到。