Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .
翻译:受过培训的大型语言模型显示有希望进行少见的学习,完成基于文本的任务,只给出几个特定任务的例子。模型将很快解决迄今留给人类研究助理的分类任务吗?现有的基准不是用来衡量应用环境中的进展的,因此不能直接回答这个问题。RAFT基准(Real-world附加说明的少见任务)侧重于自然发生的任务,并使用一个反映部署的评价设置。RAFT基线评价揭示了当前技术挣扎的领域:对长文本和许多类任务进行推理。人类基线显示,一些分类任务对于非专家人来说是困难的,反映了现实世界的价值有时取决于领域的专门知识。但即使是非专家的F1基准分数也平均超过GPT-30.11。 RAFT数据集和领导板将跟踪在https://raft.eclicion.org上将改进模式转化为实际世界效益的模型。