Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded "human-level" performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers. To help accelerate this line of work, we introduce CLUES (Constrained Language Understanding Evaluation Standard), a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at https://github.com/microsoft/CLUES.
翻译:在自然语言理解(NLUE)方面,最近取得的最新进展部分是由诸如GLUE、SuperGLUE、SQAD等基准推动的。事实上,许多NLU模型现在与这些基准中许多任务的“人级”业绩相匹配或超过“人级”业绩。然而,大多数这些基准使模型能够获取相对较多的标签数据以用于培训。因此,模型提供的数据远多于人类为取得强效所需要的数据。这促使了一系列工作,重点是改进NLU模型的微小的学习绩效。然而,由于不同论文中的不同实验环境,少发NLU的少发NLU模型缺乏标准化评价基准。为加快这项工作,我们引入了CLUS(Cregniced语言理解评价标准标准标准),这是评估NLU模型的少发学习能力的一个基准。我们证明,虽然最近的一些模型在获得大量标签数据时达到人性化业绩,但在为大多数任务绘制的几张照片时,在业绩方面存在着巨大的差距。我们还在设计一些模型的家庭与实验性模型和实验性标准化方法方面有差异,我们在设计了几个标准,我们学习了几个指标,我们学习了一些标准。最后可以学习了一些标准。我们学习了一些标准。