While single image shadow detection has been improving rapidly in recent years, video shadow detection remains a challenging task due to data scarcity and the difficulty in modelling temporal consistency. The current video shadow detection method achieves this goal via co-attention, which mostly exploits information that is temporally coherent but is not robust in detecting moving shadows and small shadow regions. In this paper, we propose a simple but powerful method to better aggregate information temporally. We use an optical flow based warping module to align and then combine features between frames. We apply this warping module across multiple deep-network layers to retrieve information from neighboring frames including both local details and high-level semantic information. We train and test our framework on the ViSha dataset. Experimental results show that our model outperforms the state-of-the-art video shadow detection method by 28%, reducing BER from 16.7 to 12.0.
翻译:虽然近年来单一图像影子探测工作正在迅速改善,但是由于数据稀缺和模拟时间一致性方面的困难,视频影子探测仍是一项艰巨的任务。当前的视频影子探测方法通过共同关注来实现这一目标,这主要是利用时间上一致的信息,但在探测移动的阴影和小阴影区域方面并不强有力。在本文件中,我们提出了一个简单而有力的方法来改善时间上的总体信息。我们使用基于光学流动的扭曲模块来对各框架进行对齐,然后将各框架的特征组合在一起。我们将这个扭曲模块应用于多个深网络层,从周边框架中检索信息,包括本地细节和高层次语义信息。我们在维沙数据集上培训和测试了我们的框架。实验结果显示,我们的模型比最新视频影子探测方法高出28%,将ERM从16.7降至12.0。