Despite the growing demand for interactive AI systems, there have been few comprehensive studies on human-AI interaction in visual understanding e.g. segmentation. Inspired by the development of prompt-based universal interfaces for LLMs, this paper presents SEEM, a promptable, interactive model for Segmenting Everything Everywhere all at once in an image. SEEM has four desiderata: i) Versatility: by introducing a versatile prompting engine for different types of prompts, including points, boxes, scribbles, masks, texts, and referred regions of another image; ii) Compositionality: by learning a joint visual-semantic space for visual and textual prompts to compose queries on the fly for inference as shown in Fig 1; iii)Interactivity: by incorporating learnable memory prompts to retain dialog history information via mask-guided cross-attention; and iv) Semantic-awareness: by using a text encoder to encode text queries and mask labels for open-vocabulary segmentation.
翻译:尽管对交互式AI系统的需求正在增长,但在可视化理解(例如分割)中,关于人工智能与人类交互的综合性研究却很少。受到为LLMs开发适用于所有类型的通用界面的启发,本文提出了SEEM,一种可提示、互动式的模型,用于一次性地在图像中对所有东西进行分割。 SEEM具有四个期望:i)通用性:通过引入用于不同类型提示的通用提示引擎,包括点、框、涂鸦、掩码、文本和另一个图像的指定区域;ii)组成性:通过学习联合视觉-语义空间,用于视觉和文本提示来动态组合查询,即Fig 1所示;iii)互动性:通过采用可学习的记忆提示来保留对话历史记录信息,通过基于掩码的交叉注意力实现;和iv)语义感知:通过使用文本编码器对文本查询和掩码标签进行编码以进行开放式词汇的分割。