Monocular depth inference has gained tremendous attention from researchers in recent years and remains as a promising replacement for expensive time-of-flight sensors, but issues with scale acquisition and implementation overhead still plague these systems. To this end, this work presents an unsupervised learning framework that is able to predict at-scale depth maps and egomotion, in addition to camera intrinsics, from a sequence of monocular images via a single network. Our method incorporates both spatial and temporal geometric constraints to resolve depth and pose scale factors, which are enforced within the supervisory reconstruction loss functions at training time. Only unlabeled stereo sequences are required for training the weights of our single-network architecture, which reduces overall implementation overhead as compared to previous methods. Our results demonstrate strong performance when compared to the current state-of-the-art on multiple sequences of the KITTI driving dataset and can provide faster training times with its reduced network complexity.
翻译:近年来,单心深度推论引起了研究人员的极大关注,并且仍然是昂贵飞行时间传感器的一个大有希望的替代器,但规模获取和实施间接费用的问题仍然困扰着这些系统。为此,这项工作提出了一个不受监督的学习框架,除了通过单一网络从单心图像序列中自动预测摄像头深度图和自我振动外,还能够预测从一个单一网络的单眼图像序列中进行大规模深度图和自我振动。我们的方法包括了空间和时间几何限制,以解决深度和造成规模因素,这些限制在培训时监督重建损失功能中执行。只有无标签的立体序列才能用于培训我们单一网络结构的重量,从而比以往方法减少总体执行间接费用。我们的结果表明,与目前KITTI驱动数据集的多个序列的状态相比,我们的表现非常出色,并且能够以网络复杂性降低的速度提供更快的培训时间。