Vision Transformers have shown great performance in single tasks such as classification and segmentation. However, real-world problems are not isolated, which calls for vision transformers that can perform multiple tasks concurrently. Existing multi-task vision transformers are handcrafted and heavily rely on human expertise. In this work, we propose a novel one-shot neural architecture search framework, dubbed AutoTaskFormer (Automated Multi-Task Vision TransFormer), to automate this process. AutoTaskFormer not only identifies the weights to share across multiple tasks automatically, but also provides thousands of well-trained vision transformers with a wide range of parameters (e.g., number of heads and network depth) for deployment under various resource constraints. Experiments on both small-scale (2-task Cityscapes and 3-task NYUv2) and large-scale (16-task Taskonomy) datasets show that AutoTaskFormer outperforms state-of-the-art handcrafted vision transformers in multi-task learning. The entire code and models will be open-sourced.
翻译:视觉Transformer在单个任务(如分类和分割)中表现出了很好的性能。然而,现实世界中的问题不是孤立的,这需要可以同时执行多个任务的视觉Transformer。现有的多任务视觉Transformer是手工制作的,并且严重依赖于人类专业知识。在本文中,我们提出了一种新颖的一次性神经结构搜索框架,称为AutoTaskFormer(自动化多任务视觉Transformer),以自动化此过程。AutoTaskFormer不仅可以自动识别在多个任务之间共享的权重,还可以为在各种资源约束下部署的宽范围参数(例如,头数和网络深度)提供数千个经过良好训练的视觉Transformer。在小规模(包括两个任务的Cityscapes和三个任务的NYUv2)和大规模(包括16个任务的Taskonomy)数据集上的实验证明,AutoTaskFormer在多任务学习中优于现有的手工制作的视觉Transformer。整个代码和模型将开源。