Temporally consistent depth estimation is crucial for real-time applications such as augmented reality. While stereo depth estimation has received substantial attention that led to improvements on a frame-by-frame basis, there is relatively little work focused on maintaining temporal consistency across frames. Indeed, based on our analysis, current stereo depth estimation techniques still suffer from poor temporal consistency. Stabilizing depth temporally in dynamic scenes is challenging due to concurrent object and camera motion. In an online setting, this process is further aggravated because only past frames are available. In this paper, we present a technique to produce temporally consistent depth estimates in dynamic scenes in an online setting. Our network augments current per-frame stereo networks with novel motion and fusion networks. The motion network accounts for both object and camera motion by predicting a per-pixel SE3 transformation. The fusion network improves consistency in prediction by aggregating the current and previous predictions with regressed weights. We conduct extensive experiments across varied datasets (synthetic, outdoor, indoor and medical). In both zero-shot generalization and domain fine-tuning, we demonstrate that our proposed approach outperforms competing methods in terms of temporal stability and per-frame accuracy, both quantitatively and qualitatively. Our code will be available online.
翻译:对实时应用(如扩大现实)而言,持续温度的深度估计至关重要。尽管立体深度估计已获得大量关注,导致框架框架基础上的改进,但相对而言,在保持各框架间的时间一致性方面,工作重点相对较少。事实上,根据我们的分析,目前立体深度估计技术仍然受到时间一致性差的影响。在动态场景中稳定深度由于同时的物体和相机动作而具有挑战性。在网上环境下,由于只有过去的框架,这一过程更加恶化。在本文中,我们提出了一个在动态场景(在线环境的动态场景中)产生时间一致深度估计的技术。我们的网络以新的运动和聚合网络加强当前的每框架立体立体立体网络。运动网络通过预测每像素SE3的变异,记录物体和摄影机的动作。聚合网络通过将当前和以往的预测与递减重量合并,提高预测的一致性。我们在各种数据集(合成、室外、室内和医疗)进行广泛的实验。在零光谱和域微调中,我们提出的方法将超越我们提出的在时间稳定性和定量方面都具有竞争性的方法。