M$3美元VT:利用模型加速器共同设计高效多任务学习的混合专家愿景变异器 (M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design)

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M$^3$ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse expert pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. When executing single-task inference, M$^{3}$ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88% inference FLOPs. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2.4 times, while achieving energy efficiency up to 9.23 times higher than a comparable FPGA baseline. Code is available at: https://github.com/VITA-Group/M3ViT.

翻译：多任务学习( MTL) 在一个单一模式中包含多重学习任务, 常常让这些任务更好地共同学习。但是, 当将 MTL 部署到那些通常资源受限制或对纬度敏感的真实世界系统时, 出现两个突出的挑战:(一) 在培训期间, 由于任务之间的梯度冲突, 同时优化所有任务往往很困难;(二) 在推论中, 当前的 MTL 制度不得不将几乎整个模型激活, 甚至只是执行一个单一任务。然而, 大多数真正的系统每时只需要一到两个任务, 并且根据需要在任务之间转换 : 因此, 所有启动的任务也效率非常低, 并且无法升级。在本文中, 我们的代码框架, 调低 M3$3 维特, 将混合专家的层(MOE) 变成一个视野变异的值( ViT) 。当值为MTL 时, 直径( VIT) 时, 直径( VIT) 的基) 在培训期间, 平调专家。随后, 将一个更高层次的直径直调任务的平调 3 任务, 在设计中, 升级的智能的智能中, 也能够实现特定的智能设计, 直裁, 直裁使我们特定的调的智能时间, 直裁。

相关内容

多任务学习

关注 161

多任务学习（MTL）是机器学习的一个子领域，可以同时解决多个学习任务，同时利用各个任务之间的共性和差异。与单独训练模型相比，这可以提高特定任务模型的学习效率和预测准确性。多任务学习是归纳传递的一种方法，它通过将相关任务的训练信号中包含的域信息用作归纳偏差来提高泛化能力。通过使用共享表示形式并行学习任务来实现,每个任务所学的知识可以帮助更好地学习其它任务。

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日