This paper proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only 6-DOF camera motion but also 6-DOF moving object instances. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances. One novelty of the proposed method is the use of the multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. Most image-to-depth predication frameworks make the assumption of rigid scenes, which largely degrades their performance with respect to dynamic objects. Only a few SOTA papers have accounted for dynamic objects. The proposed method is shown to outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed. Furthermore, the proposed image-to-depth prediction framework is also shown to be competitive with SOTA video-to-depth prediction frameworks.
翻译:本文提出一个自我监督的单视图像到深度预测框架,经过培训,对端到端的光度损失进行训练,不仅处理6-DOF摄像运动,而且处理6-DOF移动物体情况。自我监督是通过使用深度和场景运动(包括物体实例)对图像进行扭曲的视频序列进行;拟议方法的一个新颖之处是使用变压器网络的多头目关注,该变压器网络将不同时间移动的物体相匹配,并模拟其相互作用和动态。这样可以对每个对象实例作出准确和稳健的估计。大多数图像到深度的预测框架假设了僵硬的场景,这些场景在很大程度上降低了它们相对于动态物体的性能。只有少数SOTA文件对动态物体进行了说明。拟议的方法显示这些方法超过了标准基准,动态运动对这些基准的影响也暴露了出来。此外,拟议的图像到深度预测框架也显示与SOTA视频到深度预测框架具有竞争力。