In this paper we propose USegScene, a framework for semantically guided unsupervised learning of depth, optical flow and ego-motion estimation for stereo camera images using convolutional neural networks. Our framework leverages semantic information for improved regularization of depth and optical flow maps, multimodal fusion and occlusion filling considering dynamic rigid object motions as independent SE(3) transformations. Furthermore, complementary to pure photo-metric matching, we propose matching of semantic features, pixel-wise classes and object instance borders between the consecutive images. In contrast to previous methods, we propose a network architecture that jointly predicts all outputs using shared encoders and allows passing information across the task-domains, e.g., the prediction of optical flow can benefit from the prediction of the depth. Furthermore, we explicitly learn the depth and optical flow occlusion maps inside the network, which are leveraged in order to improve the predictions in therespective regions. We present results on the popular KITTI dataset and show that our approach outperforms other methods by a large margin.
翻译:在本文中,我们提出USegScene,这是一个在不受监督的情况下,利用神经神经网络,对立体摄像机图像进行深度、光学流和自我感动估计的系统指导学习的框架。我们的框架利用语义信息改进深度和光学流图、多式聚合和隔热填充的规范化,将动态僵硬物体动作视为独立的SE(3)变异。此外,除了纯粹的光度匹配外,我们还提议对连续图像之间的语义特征、像素类和对象边框进行匹配。与以往的方法不同,我们提议了一个网络结构,利用共享的编码器共同预测所有产出,并允许将信息传递到任务领域,例如,预测光学流可以受益于深度的预测。此外,我们明确学习网络内的深度和光学流隔热图,这些图被用来改进有关区域的预测。我们介绍了广受欢迎的KITTI数据集的结果,并表明我们的方法大大超越了其他方法。