Self-supervised monocular depth estimation approaches either ignore independently moving objects in the scene or need a separate segmentation step to identify them. We propose MonoDepthSeg to jointly estimate depth and segment moving objects from monocular video without using any ground-truth labels. We decompose the scene into a fixed number of components where each component corresponds to a region on the image with its own transformation matrix representing its motion. We estimate both the mask and the motion of each component efficiently with a shared encoder. We evaluate our method on three driving datasets and show that our model clearly improves depth estimation while decomposing the scene into separately moving components.
翻译:自我监督的单眼深度估计方法或者忽略了在现场独立移动天体,或者需要一个单独的分离步骤来识别它们。 我们建议Monno DepehSeg 联合估计从单眼视频中移动天体的深度和段次,而不使用任何地面真相标签。 我们将场景分解成一个固定数量的部件,其中每个部件与图像上的区域相匹配,并用自己的变异矩阵来代表其运动。 我们用一个共享的编码器来有效估计每个部件的面罩和运动。 我们评估了三个驱动数据集的方法, 并显示我们的模型在将场景分解成单独的移动元件的同时,明显改进了深度估计。