Learning scene flow from a monocular camera still remains a challenging task due to its ill-posedness as well as lack of annotated data. Self-supervised methods demonstrate learning scene flow estimation from unlabeled data, yet their accuracy lags behind (semi-)supervised methods. In this paper, we introduce a self-supervised monocular scene flow method that substantially improves the accuracy over the previous approaches. Based on RAFT, a state-of-the-art optical flow model, we design a new decoder to iteratively update 3D motion fields and disparity maps simultaneously. Furthermore, we propose an enhanced upsampling layer and a disparity initialization technique, which overall further improves accuracy up to 7.2%. Our method achieves state-of-the-art accuracy among all self-supervised monocular scene flow methods, improving accuracy by 34.2%. Our fine-tuned model outperforms the best previous semi-supervised method with 228 times faster runtime. Code will be publicly available.
翻译:单眼相机的学习场景仍是一个具有挑战性的任务,因为其位置不好,而且缺乏附加说明的数据。自我监督的方法显示学习场景根据未贴标签的数据进行估计,但准确性落后于(半)监督的方法。在本文中,我们引入了一种自我监督的单眼场景流程方法,大大提高了以往方法的准确性。基于一种最先进的光学流模型RAFT,我们设计了一个新的解码器,以便同时迭接更新三维运动场和差异图。此外,我们提出了一种强化的采集层和差异初始化技术,总体而言,将精确性提高到7.2%。我们的方法在所有自我监督的单眼场景流程方法中达到了最先进的准确性,提高了34.2%的准确性。我们经过精细调整的模型比前最佳的半监视方法高出了228倍的运行时间。代码将公布于众。