We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic segmentation, reshading, surface normal estimation, 2D keypoint detection, and edge detection. Based on the Swin transformer model, our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads. At the heart of our approach is a shared attention mechanism modeling the dependencies across the tasks. We evaluate our model on several multitask benchmarks, showing that our MulT framework outperforms both the state-of-the art multitask convolutional neural network models and all the respective single task transformer models. Our experiments further highlight the benefits of sharing attention across all the tasks, and demonstrate that our MulT model is robust and generalizes well to new domains. Our project website is at https://ivrl.github.io/MulT/.
翻译:我们提议了一个名为MulT的端到端多任务学习变异器框架,以同时学习多个高层次的愿景任务,包括深度估计、语义分割、重新阴影、表面正常估计、2D关键点探测和边缘探测。基于Swin变异器模型,我们的框架将输入图像编码成一个共享的表达方式,并利用基于任务变异器的脱coder头目对每一项愿景任务作出预测。我们的方法的核心是建立一个共同关注机制,对各项任务之间的依赖性进行建模。我们用多个多任务基准来评估我们的模型,显示我们的MulT框架超越了艺术的多任务共进神经网络模型和所有相应的单一任务变异器模型。我们的实验进一步强调了在所有任务中共享关注的好处,并表明我们的MulT模型是强大的,并且对新的领域进行了广泛的普及。我们的项目网站是 https://ivrl.github.io/MulT/。