用于联合深度和动态实地估算的联合深度和动态实地估算的意向性和差异性学习 (Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation)

Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task that often relies on the so-called scene rigidity assumption. When observing a dynamic environment, this assumption is violated which leads to an ambiguity between the ego-motion of the camera and the motion of the objects. To solve this problem, we present a self-supervised learning framework for 3D object motion field estimation from monocular videos. Our contributions are two-fold. First, we propose a two-stage projection pipeline to explicitly disentangle the camera ego-motion and the object motions with dynamics attention module, called DAM. Specifically, we design an integrated motion model that estimates the motion of the camera and object in the first and second warping stages, respectively, controlled by the attention module through a shared motion encoder. Second, we propose an object motion field estimation through contrastive sample consensus, called CSAC, taking advantage of weak semantic prior (bounding box from an object detector) and geometric constraints (each object respects the rigid body motion model). Experiments on KITTI, Cityscapes, and Waymo Open Dataset demonstrate the relevance of our approach and show that our method outperforms state-of-the-art algorithms for the tasks of self-supervised monocular depth estimation, object motion segmentation, monocular scene flow estimation, and visual odometry.

翻译：将摄影机的动作与镜头的立体结构用单镜视觉系统来估计相机的动作以及场景的三维结构,是一项复杂的任务,往往依赖于所谓的场景僵硬假设。观察动态环境时,这一假设被违反,导致镜头自我移动和物体运动之间的模糊性。为了解决这个问题,我们提出了一个由单镜视频为三维物体运动场估测提供自我监督的学习框架。我们的贡献是两重的。首先,我们提出一个两阶段投射管道,以明确分解相机自我移动和带有动态关注模块的物体运动运动。具体地说,我们设计了一个综合运动模型,用以估计镜头和物体在第一和第二战争阶段的动作动作和物体运动,分别由注意模块通过一个共同动作编码来控制。第二,我们建议通过对比性样本共识来进行物体运动场估测算,我们称之为CSAC,利用弱的语义先前(来自物体探测器的框框)和几何测量限制(每个对象都尊重僵硬的物体运动模型)。在KITTI、城市景色图案外的视觉估测图,展示了我们运动的轨道定型方法,并展示了我们本身的内定的自我定型定型方法。