Traditional multi-task learning (MTL) methods use dense networks that use the same set of shared weights across several different tasks. This often creates interference where two or more tasks compete to pull model parameters in different directions. In this work, we study whether sparsely activated Mixture-of-Experts (MoE) improve multi-task learning by specializing some weights for learning shared representations and using the others for learning task-specific information. To this end, we devise task-aware gating functions to route examples from different tasks to specialized experts which share subsets of network weights conditioned on the task. This results in a sparsely activated multi-task model with a large number of parameters, but with the same computational cost as that of a dense model. We demonstrate such sparse networks to improve multi-task learning along three key dimensions: (i) transfer to low-resource tasks from related tasks in the training mixture; (ii) sample-efficient generalization to tasks not seen during training by making use of task-aware routing from seen related tasks; (iii) robustness to the addition of unrelated tasks by avoiding catastrophic forgetting of existing tasks.
翻译:传统的多任务学习方法(MTL)使用密集的网络,这些网络使用不同的不同任务中相同的共享权重。这往往造成干扰,因为两个或两个以上的任务竞相将模型参数拉到不同的方向。在这项工作中,我们研究的是,微弱激活的“专家混合”(MOE)是否通过专门化某些权重来改进多任务学习,学习共享的表述方式,并利用其他的权重来学习具体任务信息。为此,我们设计了任务认知功能,将不同任务的例子输送给专门专家,这些专家分享取决于任务的网络权重子。这导致产生大量参数的微弱启动多任务模型,但计算成本与密集模型相同。我们展示了这种稀少的网络,以便按照三个关键层面改进多任务学习:(一) 从培训混合中的相关任务转到资源低的任务;(二) 抽样高效的概括化到培训期间没有看到的任务,通过使用任务识别权重的路径进行培训;(三) 坚固性,通过避免现有任务的灾难性的遗忘,增加不相干的任务。