Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.
翻译:人类可以轻松地想象被遮挡的物体和场景的完整 3D 几何形状,这种魅力十足的能力对于识别和理解至关重要。为了使AI系统具备这种能力,我们提出了 VoxFormer,这是一个基于 Transformer 的语义场景补全框架,可以仅从2D图像输出完整的三维体积语义。我们的框架采用两阶段设计,其中我们从深度估计中开始,在稀疏的可见和占用体素查询的基础上进行密集化生成,从而生成密集的3D体素。这一设计的关键思想是,2D图像上的视觉特征仅对应于可见场景结构而不是遮挡或空白空间。因此,从可见结构的特征化和预测开始更可靠。一旦我们获取了一组稀疏的查询,我们将应用掩码自动编码器的设计通过自我注意力将信息传播到所有体素。 SemanticKITTI上的实验表明,VoxFormer 在几何学和语义方面的性能优于现有技术,相对提高了20.0%和18.1%,并在训练期间将GPU内存减少到不到16GB。我们的代码可在 https://github.com/NVlabs/VoxFormer 上获取。