We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io
翻译:我们根据近期在高忠诚图像生成方面的成功经验,利用排除性扩散模型来制定单向深度估算;为此,我们采用创新办法,解决培训数据中由于杂音和不完全的深度地图而产生的问题,包括分层拆散、损失1美元和在培训期间填充深度。为了应对监督培训方面数据有限的情况,我们利用自我监督图像到图像翻译任务方面的预先培训。尽管方法简单,具有一般损失和结构,但我们的深度Gen模型在室内NYU数据集上实现了SOTA性能,在室外KITTI数据集上也取得了SOTA性能。此外,多式外表层(例如透明表面)自然代表深度模糊性(例如),其零光性能加上深度浸泡,使得一个简单但有效的文本到3D管道。项目网页:https://ext-gen.github。</s>