We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research
翻译:我们提出了一种基于条件扩散管道的简单,高效但功能强大的框架,用于密集视觉预测。我们的方法通过从随机高斯分布中逐步消除噪声(在图像的指导下),遵循“从噪声到地图”的生成范例进行预测。该方法称为 DDP,在现代感知管道中高效地扩展了去噪扩散过程。DDP 不需要针对任何特定任务的设计和架构定制,易于推广到大多数密集预测任务,例如语义分割和深度估计。此外,DDP 显示出一些优越的特性,例如动态推理和不确定性感知,与以前的单步判别方法相比。我们在六个不同基准测试任务中展示了三个典型任务的顶级结果。在不使用技巧的情况下,DDP 在每个任务上都与专业对手相比获得了最先进或有竞争力的性能。例如,语义分割(在 Cityscapes 上 83.9 mIoU),BEV 地图分割(在 nuScenes 上 70.6 mIoU)和深度估计(在 KITTI 上 0.05 REL)。我们希望我们的方法将作为坚实的基准,并有助于未来的研究。