This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
翻译:本文旨在解决训练单一网络以联合执行多种密集预测任务(如分割与深度估计)的挑战,即多任务学习(MTL)。现有方法主要在二维图像空间中捕捉跨任务关系,常导致特征缺乏三维感知且结构松散。我们认为,三维感知对于建模跨任务相关性至关重要,是实现全面场景理解的关键。为此,我们提出通过整合跨视角相关性(即代价体积)作为几何一致性约束,将其融入MTL网络中以解决该问题。具体而言,我们引入一个轻量级的跨视角模块(CvM),该模块在任务间共享,用于跨视角信息交换并捕获跨视角相关性,再与MTL编码器提取的特征结合进行多任务预测。此模块与网络架构无关,可适用于单视角及多视角数据。在NYUv2和PASCAL-Context数据集上的大量实验结果表明,我们的方法能有效将几何一致性注入现有MTL方法中,从而提升性能。