We present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.
翻译:我们提出了一种使用在互联网规模数据集上训练的潜在扩散模型(LDMs)进行实际图像和人工智能生成图像分割的技术。首先,我们表明相比于其他特征表示方式,如RGB图像或CLIP编码,LDMs的潜在空间(z-space)是一种更好的输入表示方式用于文本图像分割。通过在潜在z-space上进行分割模型的训练,这种压缩表示跨越了不同形式的艺术、卡通、插图和照片等多个领域,我们也能够弥合实际图像和AI生成图像之间的领域差距。我们展示了LDMs的内部特征包含丰富的语义信息,并提出了一种LD-ZNet的技术来进一步提高文本图像分割的性能。总体而言,在自然图像的文本到图像分割上,我们展示了对标准基线的最多6%的改进。对于人工智能生成的图像,我们展示了接近20%的改进,相比于最先进的技术。