It is challenging to annotate large-scale datasets for supervised video shadow detection methods. Using a model trained on labeled images to the video frames directly may lead to high generalization error and temporal inconsistent results. In this paper, we address these challenges by proposing a Spatio-Temporal Interpolation Consistency Training (STICT) framework to rationally feed the unlabeled video frames together with the labeled images into an image shadow detection network training. Specifically, we propose the Spatial and Temporal ICT, in which we define two new interpolation schemes, \textit{i.e.}, the spatial interpolation and the temporal interpolation. We then derive the spatial and temporal interpolation consistency constraints accordingly for enhancing generalization in the pixel-wise classification task and for encouraging temporal consistent predictions, respectively. In addition, we design a Scale-Aware Network for multi-scale shadow knowledge learning in images, and propose a scale-consistency constraint to minimize the discrepancy among the predictions at different scales. Our proposed approach is extensively validated on the ViSha dataset and a self-annotated dataset. Experimental results show that, even without video labels, our approach is better than most state of the art supervised, semi-supervised or unsupervised image/video shadow detection methods and other methods in related tasks. Code and dataset are available at \url{https://github.com/yihong-97/STICT}.
翻译:使用直接在视频框架贴标签图像上培训的模型,可能导致高一般化错误和时间上不一致的结果。在本文件中,我们通过提出一个Spatio-Termodal Indolation Consistance Traination(STICT)框架来应对这些挑战,以便合理将未贴标签的视频框架与标签图像一起纳入图像影子探测网络培训。具体地说,我们提议采用空间和时间信通技术,在其中我们定义了两种新的内推办法,即:\textit{i.e.}、空间内插和时间内插。然后,我们得出空间和时间内插一致性方面的制约因素,从而分别加强像素类分类任务中的概括化和鼓励时间一致性预测。此外,我们设计了一个用于图像中多尺度影子知识学习的缩放软件网络,并提出一个规模上的一致性限制,以尽量减少不同尺度的预测之间的差异。我们提出的方法,即ViSha数据设置的广泛验证,以及时间上的内插图一致性限制,在自我-Stavial-Servial 的检测方法上,甚至没有进行自我-tavial-Serviewal-Seral laveal lave-deal lax,结果显示的其他方法显示/Seral-d-d-d-s-s-lavedaldaldals-ladddddddds) 显示,甚至没有更好的其他Surviewdaldaldaldalddds-daldaldddddddddddddddddgaldgalddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd