How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce Natural-Instructions v2, a collection of 1,600+ diverse language tasks and their expert written instructions. More importantly, the benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting. This benchmark is collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. This benchmark enables large-scale evaluation of cross-task generalization of the models -- training on a subset of tasks and evaluating on the remaining unseen ones. For instance, we are able to rigorously quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes. As a by-product of these experiments. we introduce Tk-Instruct, an encoder-decoder Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark. We hope this benchmark facilitates future progress toward more general-purpose language understanding models.
翻译:在提供语言指导时,我们如何衡量各种模式的通用性?为了促进这一目标的进展,我们采用了自然指令 v2, 收集了1,600+多种语言任务及其专家书面指示。更重要的是,基准涵盖70+不同任务类型,如标记、填充和重写。这一基准是利用社区国家劳工政策执行人员的贡献和通过反复的同侪审查过程收集的,以确保质量。这一基准使得能够大规模评价各种模式的跨任务通用性 -- -- 一组任务的培训以及其余的不可见任务的评价。例如,我们能够严格量化一般化,作为各种规模参数的函数,例如所观察到的任务数目、实例数目和模型大小。作为这些实验的副产品,我们引入了Tk-Instruct, 一种编码解密器变换器,它经过培训可以遵循各种文字指示(语言任务定义或K-shot示例),这些指示比我们的基准现有较大的模型要长得多。我们希望这一基准能够促进今后的进展。