Click-based interactive image segmentation aims at extracting objects with limited user clicking. Hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for the downstream task without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive segmentation. To fill this gap, we propose the first plain-backbone method, termed as SimpleClick due to its simplicity in architecture, for interactive segmentation. With the plain backbone pretrained as masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance without bells and whistles. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over previous best result. Extensive evaluation of medical images highlights the generalizability of our method. We also provide a detailed computation analysis for our method, highlighting its availability as a practical annotation tool.
翻译:点击基点的交互式图像分割法旨在用有限的用户点击提取对象。 等级主干线是当前方法的脱法结构。 最近, 普通的非等级视野变异器( VIT) 成为密集预测任务的竞争主干线。 这个设计使得原始的 Vit 能够作为基础模型, 无需重新设计训练前的等级主干线就可以为下游任务进行微调。 虽然这个设计简单, 并且证明是有效的, 但它还没有被探索到交互分割法。 为了填补这个空白, 我们建议了第一个普通后骨方法, 因其结构简便而被称为简单Click, 用于互动分割 。 由于普通主干线已被训练为掩码自动编码( MAE ), 简单Click 实现了最先进的性能。 值得注意的是, 我们的方法在SBD上达到了4. 15 NoC@ 90, 比以前的最佳结果提高了21.8% 。 对医学图像的广泛评估突出了我们的方法的可概括性。 我们还为我们的方法提供了详细的计算分析, 突出其作为实用工具的可用性。