Despite the growing demand for interactive AI systems, there have been few comprehensive studies on human-AI interaction in visual understanding e.g. segmentation. Inspired by the development of prompt-based universal interfaces for LLMs, this paper presents SEEM, a promptable, interactive model for Segmenting Everything Everywhere all at once in an image. SEEM has four desiderata: i) Versatility: by introducing a versatile prompting engine for different types of prompts, including points, boxes, scribbles, masks, texts, and referred regions of another image; ii) Compositionality: by learning a joint visual-semantic space for visual and textual prompts to compose queries on the fly for inference as shown in Fig 1; iii)Interactivity: by incorporating learnable memory prompts to retain dialog history information via mask-guided cross-attention; and iv) Semantic-awareness: by using a text encoder to encode text queries and mask labels for open-vocabulary segmentation.
翻译:虽然对于交互式AI系统的需求增长,但在视觉理解领域尤其是分割领域的人机交互方面的综合研究还很少。本文受到对LLMs通用界面的提示性发展的启发,提出了SEEM,即一种可提示、交互式模型,用于在图像中同时分割所有的东西。SEEM具有四个愿景: i)多功能性:通过引入适用于不同类型提示的万能提示引擎,包括点、盒、涂鸦、蒙版、文本和另一个图像的引用区域;ii)组合性:通过学习联合视觉-语义空间来进行视觉和文本提示的查询组合,如图1所示; iii)交互性:通过引入可学习的记忆提示,通过掩码引导的交叉注意力来保留对话历史信息; iv)语义感知:通过使用文本编码器来对文本查询和掩码标签进行编码,从而实现开放式词汇的分割。