Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation. Our official project page is https://github.com/RaymondWang987/FMNet.
翻译:时间一致性是视频深度估计的关键挑战。 先前的作品基于额外的光学流或相机配置,这是很费时的。 相反,我们与更少的信息保持一致。 由于视频本身存在大量时间冗余,因此可以从相邻的图像中找到缺失的框架。 受此启发, 我们建议使用框架遮蔽网络( FMNet), 即空间时变压器网络, 以其相邻框架为基础预测遮蔽框架的深度。 通过重建遮蔽时间特征, FMNet可以学习内在的框架间关联, 从而实现一致性。 与以往的艺术相比, 实验结果表明, 我们的方法在空间上具有可比性, 时间上的一致性更高, 没有附加信息。 我们的工作为持续视频深度估算提供了新的视角。 我们的正式项目网页是 https://github.com/ RaymondWang987FMNet 。