Self-supervised learning of depth map prediction and motion estimation from monocular video sequences is of vital importance -- since it realizes a broad range of tasks in robotics and autonomous vehicles. A large number of research efforts have enhanced the performance by tackling illumination variation, occlusions, and dynamic objects, to name a few. However, each of those efforts targets individual goals and endures as separate works. Moreover, most of previous works have adopted the same CNN architecture, not reaping architectural benefits. Therefore, the need to investigate the inter-dependency of the previous methods and the effect of architectural factors remains. To achieve these objectives, we revisit numerous previously proposed self-supervised methods for joint learning of depth and motion, perform a comprehensive empirical study, and unveil multiple crucial insights. Furthermore, we remarkably enhance the performance as a result of our study -- outperforming previous state-of-the-art performance.
翻译:在自我监督下学习单眼视频序列的深度地图预测和运动估计是极为重要的 -- -- 因为它在机器人和自主飞行器上完成了一系列广泛的任务。许多研究努力都通过处理照明变异、隔离和动态物体等提高了绩效。然而,每一项努力都是针对个别目标的,并且作为单独的作品而持续进行。此外,大多数以前的工作都采用了同样的CNN结构,没有获得建筑效益。因此,仍然需要调查以往方法的相互依赖性和建筑因素的影响。为了实现这些目标,我们重新研究许多以前提出的共同学习深度和运动的自上而上的方法,开展全面的实证研究,并揭示多种至关重要的洞察力。此外,我们通过我们的研究,大大改进了业绩 -- -- 超过以往的先进业绩。