Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark. Our code is available at https://github.com/tjiiv-cprg/EPro-PnP-v2.
翻译:从单个RGB图像中定位3D对象的透视n点(PnP)一直是计算机视觉中的悬而未决问题。受端到端深度学习驱动,最近的研究表明将PnP解释为可微分层,允许通过反向传播姿态损失的梯度部分学习2D-3D点对应关系。然而,从头开始学习整个对应关系具有非常大的挑战性,特别是对于模糊的姿态解决方案,其中全局最优姿态理论上对于点是不可微分的。在本文中,我们提出了EPro-PnP,这是一种用于一般端到端姿态估计的概率PnP层,它输出分布式的姿态,具有可微分的SE(3)流形上的概率密度。2D-3D坐标和相应的权重被视为中间变量,通过最小化预测和目标姿态分布之间的KL散度来进行学习。底层原理概括了之前的方法,并类似于注意机制。EPro-PnP可以增强现有的对应网络,缩小了基于PnP的方法和LineMOD 6DoF姿态估计基准测试中任务特定领导者之间的差距。此外,EPro-PnP有助于探索网络设计的新可能性,因为我们展示了一种新型的可变形对应网络,在nuScenes 3D对象检测基准测试中具有最先进的姿态精度。我们的代码可在https://github.com/tjiiv-cprg/EPro-PnP-v2 上获得。