Unsupervised methods have showed promising results on monocular depth estimation. However, the training data must be captured in scenes without moving objects. To push the envelope of accuracy, recent methods tend to increase their model parameters. In this paper, an unsupervised learning framework is proposed to jointly predict monocular depth and complete 3D motion including the motions of moving objects and camera. (1) Recurrent modulation units are used to adaptively and iteratively fuse encoder and decoder features. This not only improves the single-image depth inference but also does not overspend model parameters. (2) Instead of using a single set of filters for upsampling, multiple sets of filters are devised for the residual upsampling. This facilitates the learning of edge-preserving filters and leads to the improved performance. (3) A warping-based network is used to estimate a motion field of moving objects without using semantic priors. This breaks down the requirement of scene rigidity and allows to use general videos for the unsupervised learning. The motion field is further regularized by an outlier-aware training loss. Despite the depth model just uses a single image in test time and 2.97M parameters, it achieves state-of-the-art results on the KITTI and Cityscapes benchmarks.
翻译:未经监督的方法在单层深度估计方面显示出有希望的结果。 但是, 培训数据必须在不移动对象的情况下在场景中捕捉, 推动精确度的范围, 最近的方法往往会增加其模型参数。 在本文中, 提议一个未经监督的学习框架, 共同预测单层深度并完成三维运动, 包括移动对象和相机的动作。 (1) 经常调制器用于适应性和迭接性引信编码器和解密器的特性。 这不但改善了单层深度推断, 也不致超过模型参数。 (2) 使用一套过滤器进行升级, 而不是使用一套单一的过滤器, 设计出多套过滤器进行余层取样。 这有助于学习边缘保护过滤器, 并导致改进性能。 (3) 使用基于扭曲的网络来估计物体移动的运动场, 而不使用语系先前的语系。 这打破了场景的僵硬性要求, 并允许使用一般的视频进行不超强的学习。 (2) 运动场被进一步固定化, 而不是用一套外观训练损失的过滤器, 设计。 尽管深度模型测试了城市2.97 标准, 也使用了单一图像基准 。</s>