Modern 3D semantic instance segmentation approaches predominantly rely on specialized voting mechanisms followed by carefully designed geometric clustering techniques. Building on the successes of recent Transformer-based methods for object detection and image segmentation, we propose the first Transformer-based approach for 3D semantic instance segmentation. We show that we can leverage generic Transformer building blocks to directly predict instance masks from 3D point clouds. In our model called Mask3D each object instance is represented as an instance query. Using Transformer decoders, the instance queries are learned by iteratively attending to point cloud features at multiple scales. Combined with point features, the instance queries directly yield all instance masks in parallel. Mask3D has several advantages over current state-of-the-art approaches, since it neither relies on (1) voting schemes which require hand-selected geometric properties (such as centers) nor (2) geometric grouping mechanisms requiring manually-tuned hyper-parameters (e.g. radii) and (3) enables a loss that directly optimizes instance masks. Mask3D sets a new state-of-the-art on ScanNet test (+6.2 mAP), S3DIS 6-fold (+10.1 mAP), STPLS3D (+11.2 mAP) and ScanNet200 test (+12.4 mAP).
翻译:现代3D语义实例分割方法主要依赖于专门的投票机制,随后是精心设计的几何聚类技术。借鉴近期基于Transformer的目标检测和图像分割方法的成功经验,我们提出了首个基于Transformer的3D语义实例分割方法。我们展示了可以利用通用Transformer构建块直接从3D点云预测实例掩码。在我们的模型Mask3D中,每个对象实例表示为实例查询(Instance query)。利用Transformer解码器,实例查询通过迭代地注意多个尺度的点云特征被学习到。结合点特征,实例查询直接产生所有实例掩码,具有诸多优势,不需要(1)依赖需要手动选择几何特性(例如中心)的投票方案,也不需要(2)几何分组机制需要手动调整超参数(例如半径),并且(3)可以直接优化实例掩码。Mask3D在ScanNet测试(上升6.2 mAP),S3DIS 6倍交叉验证(上升10.1 mAP),STPLS3D(上升11.2 mAP)和ScanNet200测试(上升12.4 mAP)等各项指标上均取得了新的最佳性能。