Learning from a large corpus of data, pre-trained models have achieved impressive progress nowadays. As popular generative pre-training, diffusion models capture both low-level visual knowledge and high-level semantic relations. In this paper, we propose to exploit such knowledgeable diffusion models for mainstream discriminative tasks, i.e., unsupervised object discovery: saliency segmentation and object localization. However, the challenges exist as there is one structural difference between generative and discriminative models, which limits the direct use. Besides, the lack of explicitly labeled data significantly limits performance in unsupervised settings. To tackle these issues, we introduce DiffusionSeg, one novel synthesis-exploitation framework containing two-stage strategies. To alleviate data insufficiency, we synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first synthesis stage. In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features. These features can be directly used by downstream architectures. Extensive experiments and ablation studies demonstrate the superiority of adapting diffusion for unsupervised object discovery.
翻译:学习来自大量数据的预训练模型如今已经取得了令人瞩目的进展。作为广受欢迎的生成式预训练模型之一,扩散模型同时捕捉了低级别的视觉知识和高级别的语义关系。在本文中,我们提出利用这些知识丰富的扩散模型进行主流的判别式任务,即无监督目标发现:显着性分割和目标定位。然而,存在一个结构差异使得生成式和判别式模型之间是存在限制的。此外,缺乏明确标注的数据在无监督情况下严重限制了性能。为了解决这些问题,我们引入了一种新的合成—利用框架DiffusionSeg,其中包含两个阶段的策略。为了缓解数据不足的问题,在第一个合成阶段中,我们综合了大量的图像,并提出了一种新的无需训练的AttentionCut方法来获取mask。在第二个利用阶段中,为了弥合结构差距,我们使用反演技术将给定图像映射回扩散功能。这些功能可以直接被下游架构使用。广泛的实验和消融研究证明了使用扩散来实现无监督的目标发现的优越性。