Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on the self-tuned ViT feature similarity and color convergence. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse diffusion process towards recovering scene structures. We also introduce a color convergence loss to constrain the surface colors in the reverse inference process to avoid any color shifts. Our DeS3 is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity and color convergence loss during the sampling stage. This enables our DeS3 model to effectively integrate its strong modeling capabilities with input-specific knowledge in a self-tuned manner. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 20% of the RMSE of the whole image on the SRD dataset.
翻译:单幅图像中去除缺乏明确边界的软阴影和自阴影仍然具有挑战性。自阴影是投射在物体本身上的阴影。大多数现有方法依赖于二元阴影掩膜,而不考虑软阴影和自阴影的模糊边界。本文提出了DeS3方法,它基于自调整的ViT特征相似度和颜色融合,去除硬阴影、软阴影和自阴影。我们的创新ViT相似损失利用预训练的Vision Transformer提取的特征。此损失有助于指导逆扩散过程以恢复场景结构。我们还引入了一种颜色收敛损失,以约束反向推断过程中的表面颜色,以避免任何颜色偏移。我们的DeS3能够区分阴影区域和底层对象,以及阴影区域和投射阴影的对象。这种能力使DeS3能够在对象被阴影部分遮挡时更好地恢复对象的结构。与现有方法依赖于训练阶段的约束不同,我们在采样阶段引入ViT相似损失和颜色收敛损失。这使得我们的DeS3模型能够以自调整的方式有效地将其强大的建模能力与输入特定的知识集成。我们的方法在SRD、AISTD、LRSS、USR和UIUC数据集上的表现优于现有方法,稳健地去除硬阴影、软阴影和自阴影。具体而言,在SRD数据集上,我们的方法比SOTA方法的整个图像RMSE提高了20%。