Recent works suggest that transformer models are capable of multi-tasking on diverse NLP tasks and adapting to new tasks efficiently. However, the potential of these multi-task models may be limited as they use the same set of parameters for all tasks. In contrast, humans tackle tasks in a more flexible way, by making proper presumptions on what skills and knowledge are relevant and executing only the necessary computations. Inspired by this, we propose to use task-level mixture-of-expert models, which has a collection of transformer layers (i.e., experts) and a router component that chooses from these experts dynamically and flexibly. We find that these models help improve the average performance gain (ARG) metric by 2.6% when adapting to unseen tasks in the few-shot setting and by 5.6% in the zero-shot generalization setting. Further, we show that the learned routing decisions partly rediscover human categorization of NLP tasks -- certain experts are strongly associated with extractive tasks, some with classification tasks, and some with tasks requiring world knowledge.
翻译:最近的工作表明,变压器模型能够对不同的NLP任务进行多重任务,并有效地适应新的任务。然而,这些多任务模型的潜力可能有限,因为它们对所有任务都使用相同的参数。相反,人类以更灵活的方式处理任务,对哪些技能和知识相关和仅执行必要的计算方法作出适当的假设。我们为此提议使用任务级混合专家模型,其中收集了变压器层(即专家)和路由器组件,从这些专家中灵活地作出动态选择。我们发现这些模型有助于在微照设置中,在适应不可见的任务时提高2.6%的平均性能(ARG)指标,在零照概括设置中,提高5.6%的平均性能。此外,我们表明,所学过的路由决定部分地重新揭示了NLP任务的人类分类 -- 一些专家与采掘任务密切相关,有些与分类任务有关,有些与需要世界知识的任务有关。