Modern 3D semantic instance segmentation approaches predominantly rely on specialized voting mechanisms followed by carefully designed geometric clustering techniques. Building on the successes of recent Transformer-based methods for object detection and image segmentation, we propose the first Transformer-based approach for 3D semantic instance segmentation. We show that we can leverage generic Transformer building blocks to directly predict instance masks from 3D point clouds. In our model called Mask3D each object instance is represented as an instance query. Using Transformer decoders, the instance queries are learned by iteratively attending to point cloud features at multiple scales. Combined with point features, the instance queries directly yield all instance masks in parallel. Mask3D has several advantages over current state-of-the-art approaches, since it neither relies on (1) voting schemes which require hand-selected geometric properties (such as centers) nor (2) geometric grouping mechanisms requiring manually-tuned hyper-parameters (e.g. radii) and (3) enables a loss that directly optimizes instance masks. Mask3D sets a new state-of-the-art on ScanNet test (+6.2 mAP), S3DIS 6-fold (+10.1 mAP), STPLS3D (+11.2 mAP) and ScanNet200 test (+12.4 mAP).
翻译:现代 3D 语义区隔法主要依赖专门投票机制,并采用精心设计的几何群集技术。根据最近基于变异器的物体探测和图像分割法的成功经验,我们提议了基于3D 语义区隔法的首个基于变异器的方法。我们展示了我们可以利用通用变异器构件直接预测3D点云的掩码。在我们称为Mask3D 的模型中,每个对象实例都作为实例查询。使用变换器解码器,通过迭接式式处理多尺度的点云特征来学习实例查询。与点特征相结合,实例查询直接产生平行的所有实例掩码。Mask3D对当前最先进的方法具有若干优势,因为它既不依赖于 (1) 需要手选几何几何特性(如中心)或(2) 需要手动调整的超参数的几何组化机制(如:Radi) 和(3) 能够直接优化图像掩码。Mask3D在扫描Net测试(+6/M4AP)、S3DIS+ 6PLA+M1号测试(S-10) (SDIS+MAP) (SAL-10+MAP-10)。