Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer
翻译:通用图像分割不是一个新概念。 过去试图统一过去几十年图像分割的尝试, 包括现场分割、 光学分割以及最近的新的全光结构。 但是, 这样的全光结构并不真正统一图像分割, 因为他们需要单独接受语义、 实例或全光分割的培训, 才能取得最佳绩效。 理想的情况是, 一个真正的通用框架应该只培训一次, 在所有三个图像分割任务中实现 SOTA 性能。 为此, 我们提议 OneFormer, 一个通用图像分割框架, 这个框架可以以多任务列车列列列列列( ) 设计。 我们首先提出一个任务固定的联合培训战略, 使每个区域( 语义、 实例、 和 光谱部分部分部分部分) 的地面真相培训能够实现最佳绩效。 其次, 我们提出一个任务标志, 使我们的模型- 动态- 支持多任务/ 无障碍的 。 第三, 我们提议在培训期间使用直线/ 对比性损失, 建立更好的内部任务 3 和内部任务, 连续进行显著的业绩 。