预训练变压器骨干网络用于三维室内场景理解 (Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding)

Pretrained backbones with fine-tuning have been widely adopted in 2D vision and natural language processing tasks and demonstrated significant advantages to task-specific networks. In this paper, we present a pretrained 3D backbone, named {\SST}, which first outperforms all state-of-the-art methods in downstream 3D indoor scene understanding tasks. Our backbone network is based on a 3D Swin transformer and carefully designed to efficiently conduct self-attention on sparse voxels with linear memory complexity and capture the irregularity of point signals via generalized contextual relative positional embedding. Based on this backbone design, we pretrained a large {\SST} model on a synthetic Structed3D dataset that is 10 times larger than the ScanNet dataset and fine-tuned the pretrained model in various downstream real-world indoor scene understanding tasks. The results demonstrate that our model pretrained on the synthetic dataset not only exhibits good generality in both downstream segmentation and detection on real 3D point datasets, but also surpasses the state-of-the-art methods on downstream tasks after fine-tuning with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +2.1 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, +8.1 mAP@0.5 on S3DIS detection. Our method demonstrates the great potential of pretrained 3D backbones with fine-tuning for 3D understanding tasks. The code and models are available at https://github.com/microsoft/Swin3D .

翻译：预训练骨干网络与微调已被广泛应用于二维视觉和自然语言处理任务，并表现出在任务特定网络上的显着优势。在本文中，我们介绍了一种预训练的三维骨干网络，命名为{\SST}，其首次在下游三维室内场景理解任务中超过了所有最先进方法。我们的骨干网络基于一个三维 Swin 变形器，经过精心设计，可以在稀疏体素上高效地进行自我注意力，并通过广义上下文相对位置嵌入捕捉点信号的不规则性。基于这个骨干网络设计，我们在一个大型的人工合成数据集 Structed3D 上预训练了一个大型{\SST}模型，该数据集比 ScanNet 数据集大10倍，并在各种下游的真实室内场景理解任务中微调预训练模型。结果表明，我们在合成数据集上预训练的模型不仅展示了在真实三维点云数据集上的良好通用性，在下游的分割和检测任务中表现均超过了最先进方法，其中在 S3DIS Area5 和 6-fold 语义分割上，mIoU 均提升了 2.3 和 2.2，ScanNet 分割(val)上 mIoU 提升了 2.1，在ScanNet检测上提升了1.9 mAP@0.5，在 S3DIS 检测上提升了 8.1 mAP@0.5。我们的方法展示了预训练三维骨干网络与微调在三维理解任务中的巨大潜力。代码和模型可在https://github.com/microsoft/Swin3D 上获得。