开放式词汇分割和探测简单框架</s> (A Simple Framework for Open-Vocabulary Segmentation and Detection)

We present \ourmodel{}, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: $i$) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; $ii$) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, \ourmodel{} beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes. Finally, we note that \ourmodel{} is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for both tasks in open world.

翻译：我们提出一个简单的开放词汇分割和探测框架, 是一个简单的开放词汇分割和探测框架, 共同从不同的分区和探测数据集中学习。为了缩小词汇和注释颗粒的缺口, 我们首先引入一个经过预先训练的文本编码器, 将所有视觉概念编码成两个任务, 并为它们学习一个共同的语义空间。这让我们比受过分解任务训练的对应方得到一个合理的好结果。为了进一步调和它们, 我们找到了两个差异: $( ) 任务差异 -- 分解需要提取地表对象和背景材料的遮罩, 而只是探测到前两个部分; $( $)( ) 数据差异 -- 框和遮罩显示不同空间颗粒和批注的语义差异。为了解决这些问题, 我们提议解码解码解码, 减少地/ 地/ 地段之间的干扰, 和一个有条件的掩码解码解析, 来帮助生成20个模式的口罩。为了这个目的, 我们开发一个简单的代码解码模型模型模型模型模型模型模型模型模型模型, 将所有三种技术都包含所有三种技术, CO- 和在CO和OOOD 365 365 365检测中, 。</s>