Most state-of-the-art instance segmentation methods rely on large amounts of pixel-precise ground-truth annotations for training, which are expensive to create. Interactive segmentation networks help generate such annotations based on an image and the corresponding user interactions such as clicks. Existing methods for this task can only process a single instance at a time and each user interaction requires a full forward pass through the entire deep network. We introduce a more efficient approach, called DynaMITe, in which we represent user interactions as spatio-temporal queries to a Transformer decoder with a potential to segment multiple object instances in a single iteration. Our architecture also alleviates any need to re-compute image features during refinement, and requires fewer interactions for segmenting multiple instances in a single image when compared to other methods. DynaMITe achieves state-of-the-art results on multiple existing interactive segmentation benchmarks, and also on the new multi-instance benchmark that we propose in this paper.
翻译:大多数最先进的实例分割方法依赖于创建昂贵的像素精确地面实况注释。交互式分割网络可基于图像和相应的用户交互(例如单击)生成此类注释。这项任务的现有方法只能逐个处理单个实例,而每次用户交互需要通过整个深度网络进行完整的前向传递。我们引入了一种更有效的方法,称为DynaMITe,其中我们将用户交互表示为对具有潜在性的Transformer解码器的时空查询,可在单次迭代中分割多个对象实例。我们的架构还缓解了在细化期间重新计算图像特征的必要性,并且与其他方法相比需要更少的交互,以在单个图像中分割多个实例。DynaMITe在多个现有交互式分割基准测试以及我们在本文中提出的新的多实例基准测试上取得了最先进的结果。