Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD
翻译:发相模型(DMs)已成为基因化模型的新趋势,并展示了很强的有条件合成能力。在这些模型中,在大规模图像-文本配对中预先培训的文本到图像图像传播模型高度可定制性。与注重低层次属性和细节的无条件发相模型不同,文本到图像传播模型含有更多高层次的知识,这得益于视觉语言培训前的训练。在本文中,我们提议了VPD(VVPD)(经过预先培训的Difution概念,以及一个经过培训的Difmul化模型),这是一个新框架,在视觉感知任务中,利用经过培训的文本至深层次的文本流向图像传播,我们只是用它作为骨干,研究如何充分利用所学知识。具体地说,我们用正确的文字输入,用一个更快速的文本转换,在经过训练前阶段改进的文本中,使视觉值与视觉值分析发生互动。我们还提议了在视觉-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-显示-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观</s>