We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have shown the remarkable capability of generating high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We propose to leverage the frozen representation of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state-of-the-art. Project page is available at \url{https://jerryxu.net/ODISE}.
翻译:我们介绍ODISE:基于开放词汇的光学光学透视模型,它统一了预先训练的文本图像扩散和歧视性模型,以进行开放的光学截面截面。文本到图像扩散模型已经展示出产生高质量图像的非凡能力,并配有多种开放词汇语言描述。这显示其内部代表空间与现实世界的开放概念高度相关。另一方面,像CLIP这样的文本模拟歧视模型在将图像分类为开放词汇标签方面十分出色。我们提议利用这两种模型的冷冻代表面来进行野外任何类别的光学分离。我们的方法在开放语言光学透面和语义分割任务上都大大优于以前的艺术状态。特别是,仅通过COCO培训,我们的方法在ADE20K数据集上达到了23.4 PQ和30.0 mIoU,在先前的状态/OdISl/Enetry上实现了8.3 PQ和7.9 mIOU绝对改进。</s>