Inferring 3D locations and shapes of multiple objects from a single 2D image is a long-standing objective of computer vision. Most of the existing works either predict one of these 3D properties or focus on solving both for a single object. One fundamental challenge lies in how to learn an effective representation of the image that is well-suited for 3D detection and reconstruction. In this work, we propose to learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator. Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space. Moreover, we devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation, which enables fine detail reconstruction and one order of magnitude faster inference than prior methods. With complementary supervision from both 3D detection and reconstruction, one enables the 3D voxel features to be geometry and context preserving, benefiting both tasks.The effectiveness of our approach is demonstrated through 3D detection and reconstruction in single object and multiple object scenarios.
翻译:从单一 2D 图像中推断 3D 位置和多天体形状是计算机视觉的长期目标。大多数现有作品要么预测了其中的3D属性之一,要么侧重于为单一天体解决两者。一个根本的挑战是如何学习适合3D 探测和重建的图像的有效表达方式。在这项工作中,我们提议从输入图像中学习一个由3D voxel 特征组成的常规网格,通过一个 3D 特征提升操作器与 3D 场景空间相匹配。根据 3D voxel 特征,我们的新颖的 CentreNet-3D 探测头将3D 探测作为3D 空间的关键点探测方式。此外,我们设计了一个高效的粗向线重建模块,包括粗度的氧化法和一个新的本地的 CPA-SDFD 形状代表方式,这样可以细细的重建,比先前的方法更快的一个规模的推断顺序。在3D 检测和重建过程中,一个使3D oxel 3D 特征能够使3D 特征与环境相配合,既有利于任务,也有利于任务。