InterFormer: 实时交互式图像分割 (InterFormer: Real-time Interactive Image Segmentation)

Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks. However, the existing interactive segmentation pipeline suffers from inefficient computations of interactive models because of the following two issues. First, annotators' later click is based on models' feedback of annotators' former click. This serial interaction is unable to utilize model's parallelism capabilities. Second, the model has to repeatedly process the image, the annotator's current click, and the model's feedback of the annotator's former clicks at each step of interaction, resulting in redundant computations. For efficient computation, we propose a method named InterFormer that follows a new pipeline to address these issues. InterFormer extracts and preprocesses the computationally time-consuming part i.e. image processing from the existing process. Specifically, InterFormer employs a large vision transformer (ViT) on high-performance devices to preprocess images in parallel, and then uses a lightweight module called interactive multi-head self attention (I-MSA) for interactive segmentation. Furthermore, the I-MSA module's deployment on low-power devices extends the practical application of interactive segmentation. The I-MSA module utilizes the preprocessed features to efficiently response to the annotator inputs in real-time. The experiments on several datasets demonstrate the effectiveness of InterFormer, which outperforms previous interactive segmentation models in terms of computational efficiency and segmentation quality, achieve real-time high-quality interactive segmentation on CPU-only devices.

翻译：交互式图像分割使注释员能够高效地执行像素级分割任务。然而，现有的交互式分割流程因以下两个问题而受到低效计算的影响。首先，注释员的后续单击基于模型对注释员以前单击的反馈。这种串行交互无法利用模型的并行能力。其次，模型必须在每个交互步骤中重复处理图像、注释员当前单击和模型对注释员以前单击的反馈，导致冗余计算。为了实现高效计算，我们提出了一种名为InterFormer的方法，它遵循一个新的流程来解决这些问题。InterFormer从现有过程中提取和预处理计算时间消耗巨大的部分，即图像处理。具体而言，InterFormer使用大型Vision Transformer（ViT）在高性能设备上并行预处理图像，然后使用轻量级模块称为交互式多头自注意力（I-MSA）进行交互式分割。此外，I-MSA模块部署在低功率设备上，扩展了交互式分割的实际应用。I-MSA模块利用预处理特征实时高效响应注释员的输入。在几个数据集上的实验证明了InterFormer的有效性，它在计算效率和分割质量方面优于以前的交互式分割模型，实现了在仅CPU设备上实时高质量的交互式分割。