Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training by ~45% to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.
翻译:人类可以很容易地想象整个 3D 隐蔽天体和场景的完整 3D 几何 。 这种吸引能力对于识别和理解至关重要 。 为了在 AI 系统中实现这种能力, 我们提议 VoxFormer, 一个基于变异器的语义场景完成框架, 只能从 2D 图像中输出完成 3D 体积语义。 我们的框架采用一个两阶段的设计, 我们从一个稀少的、 可见的和被占用的 voxel 查询集开始, 然后从深度估计开始, 进入一个密度稠密的 3D voxels 。 这个设计的关键理念是 2D 图像的视觉特性只对应可见的场景结构, 而不是隐蔽或空空空空空间。 因此, 从视觉结构的预览和预测开始, 比较可靠。 一旦我们获得一组稀疏的查询, 我们应用一个隐蔽的自动编码设计来向所有的 voxel 传播信息。 在 Smantic kITTI 上进行实验显示, VoxFormer 超越了艺术状态, 在地理记忆/ PI 16 中相对改进了20% 和在 AS 中, 我们的加密/ trembismiss 中减少了18%