Large pretrained language models like GPT-3 have acquired a surprising ability to perform zero-shot classification (ZSC). For example, to classify review sentiments, we can "prompt" the language model with the review and the question "Is the review positive?" as the context, and ask it to predict whether the next word is "Yes" or "No". However, these models are not specialized for answering these prompts. To address this weakness, we propose meta-tuning, which trains the model to specialize in answering prompts but still generalize to unseen tasks. To create the training data, we aggregated 43 existing datasets, annotated 441 label descriptions in total, and unified them into the above question answering (QA) format. After meta-tuning, our model outperforms a same-sized QA model for most labels on unseen tasks, and we forecast that the performance would improve for even larger models. Therefore, measuring ZSC performance on non-specialized language models might underestimate their true capability, and community-wide efforts on aggregating datasets and unifying their formats can help build models that understand prompts better.
翻译:GPT-3等大型预先培训的语言模型已经获得了执行零发分类(ZSC)的惊人能力。例如,为了对审查情绪进行分类,我们可以“立即”使用审查模式,并用“审查是否积极”作为上下文的问题,要求它预测下一个词是“是”还是“否”。然而,这些模型并不是专门用来回答这些提示的。为了解决这一弱点,我们建议元调整,即培训模型专门回答提示,但仍然笼统地概括到不可见的任务。为了创建培训数据,我们汇总了43个现有数据集,总共加注了441个标签说明,并将其统一到上述问题解答(QA)格式中。在元调整后,我们的模型超越了大多数关于不可见任务标签的相同大小的QA模型,我们预测说,在更大的模型中,业绩会得到改善。因此,测量非专门语言模型的ZSC性能可能会低估其真实能力,而社区范围在汇集数据集和统一格式方面所做的努力有助于建立更能理解的模型。