Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
翻译:图像分割是指对具有不同语义学的像素进行分组,例如,类别或例类成员,其中每选择一种语义学来界定任务。虽然每项任务只有语义不同,但目前研究的重点是为每项任务设计专门结构。我们介绍了一种能够处理任何图像分割任务(泛视、实例或语义)的新结构,即遮蔽式遮蔽式变形器(Mask2Former),其关键组成部分包括遮蔽式注意,通过限制预测的遮蔽区域内的交叉注意来提取本地特征。除了将研究工作至少减少三次外,它通过四个流行数据集的显著差幅超越了最佳专业结构。最显著的是,Mask2Former为光学分割(关于COCO的57.8 PQ)、实例分割(关于COCO的50.1 AP)和语义分割(关于ADE20K的57.7 mIOU)。