Temporally consistent depth estimation is crucial for online applications such as augmented reality. While stereo depth estimation has received substantial attention as a promising way to generate 3D information, there is relatively little work focused on maintaining temporal stability. Indeed, based on our analysis, current techniques still suffer from poor temporal consistency. Stabilizing depth temporally in dynamic scenes is challenging due to concurrent object and camera motion. In an online setting, this process is further aggravated because only past frames are available. We present a framework named Consistent Online Dynamic Depth (CODD) to produce temporally consistent depth estimates in dynamic scenes in an online setting. CODD augments per-frame stereo networks with novel motion and fusion networks. The motion network accounts for dynamics by predicting a per-pixel SE3 transformation and aligning the observations. The fusion network improves temporal depth consistency by aggregating the current and past estimates. We conduct extensive experiments and demonstrate quantitatively and qualitatively that CODD outperforms competing methods in terms of temporal consistency and performs on par in terms of per-frame accuracy.
翻译:虽然立体深度估计作为产生3D信息的一种有希望的方式,得到了大量关注,但相对而言,侧重于保持时间稳定性的工作很少。事实上,根据我们的分析,目前的技术仍然缺乏时间一致性。在动态场景中稳定深度具有挑战性,因为同时的物体和摄影机运动。在网上环境下,由于只有过去的框架,这一过程会进一步恶化。我们提出了一个名为“一致在线动态动态深度估计”的框架,以便在动态场景中产生时间上一致的深度估计。CODD通过新的运动和聚合网络增强每个框架的立体网络。运动网络通过预测每像素SE3的变异和调整观察结果来说明动态。聚合网络通过汇总当前和以往的估计数来提高时间深度的一致性。我们进行了广泛的实验,并从数量和质量上证明CODD在时间一致性方面超越了相互竞争的方法,并在每个框架的准确性方面以相同的方式进行演化。