Recent advancements in diffusion models have enabled the generation of realistic deepfakes by writing textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study of the authenticity of fake images generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive study on the performance of contrastive and classification-based visual features. Our analysis demonstrates that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling strategy which let us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 600k images generated from original COCO images.
翻译:最近扩散模型的进展使得通过自然语言编写文本提示来生成逼真的Deepfakes成为可能。虽然这些模型在各个领域具有众多优点,但它们也引发了人们对虚假图像潜在滥用的担忧,并对虚假图像检测产生了新的压力。在这项工作中,我们开创了对最先进的扩散模型生成的虚假图像真实性的系统研究。首先,我们对对比和基于分类的视觉特征的性能进行了全面的研究。我们的分析表明,虚假图像具有共同的低层线索,使它们容易识别。此外,我们设计了一种多模态设置,在该设置下,虚假图像是通过不同的文本标题合成的,这些标题被用作生成器的种子。在这种设置下,我们量化了虚假检测策略的性能,并介绍了一种基于对比的分离策略,使我们能够分析文本描述和低级感知线索的语义作用。最后,我们发布了一个名为COCOFake的新数据集,其中包含约60万张从原始COCO图像生成的图像。