We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE .
翻译:我们提出了ODISE:基于预训练文本-图像扩散和判别模型的开放词汇全观察分割方法,以进行全规模分割。文本-图像扩散模型具有生成具有多样性的高质量图像的能力,具有高度相关的内部表示空间,可以涵盖真实世界中的开放概念。另一方面,像CLIP这样的文本-图像判别模型则擅长将图像分类到开放词汇标签中。我们利用这两种模型的冻结内部表示来执行任何野生类别的全规模分割。相比之前的最新技术水平,我们的方法在开放词汇全观察和语义分割任务上都取得了显著的性能提升。特别地,仅使用COCO训练,我们的方法在ADE20K数据集上取得了23.4 PQ和30.0 mIoU的成绩,相对于之前的最新技术水平改善了8.3 PQ和7.9 mIoU。我们在https://github.com/NVlabs/ODISE上开源了代码和模型。