How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce Natural-Instructions v2, a benchmark of 1,600+ diverse language tasks and their expert-written instructions. It covers 70+ distinct task types, such as tagging, in-filling, and rewriting. These tasks are collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. With this large and diverse collection of tasks, we are able to rigorously benchmark cross-task generalization of models -- training on a subset of tasks and evaluating on the remaining unseen ones. For instance, we quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes. Based on these insights, we introduce Tk-Instruct, an encoder-decoder Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark. We hope this benchmark facilitates future progress toward more general-purpose language understanding models.
翻译:在提供语言指导时,我们如何衡量模型的概括性,使其适用于各种不可见的任务?为了促进这一目标的进展,我们采用了自然指令 v2, 基准为1 600+多种语言任务及其专家编写的指令。它涵盖70+不同的任务类型,如标记、填充和重写。这些任务是通过社区国家语言方案从业人员的贡献以及通过反复的同侪审查过程收集的,以确保它们的质量。有了这种庞大和多样的任务汇编,我们得以严格地衡量模型的交叉任务概括性 -- -- 一组任务的培训以及其余的不可见任务的评估。例如,我们量化一般化,作为各种规模参数的函数,例如所观察到的任务数量、实例数量和模型大小。根据这些洞察,我们引入了Tk-Instruct, 一种编码-decoder变异器,它经过培训,可以遵循各种同源指令(语言任务定义或K-shot示例),从而超越了我们基准上现有的较大模型。我们希望这一基准有助于今后走向更普遍使用的语言。