The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev-test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods.
翻译:最近对少数自然语言理解(NLU)任务进行了大量关注,但根据一套不同的协议对先前的方法进行了评估,这妨碍了公平比较和衡量实地进展。为了解决这一问题,我们引入了一个评价框架,在三个关键方面改进了以前的评价程序,即测试性能、德夫测试相关性和稳定性。在这个新的评价框架之下,我们重新评估了NLU任务的一些最先进的少见方法。我们的框架揭示了新的洞察力:(1) 先前的文献没有准确估计这些方法的绝对性能和相对差距;(2) 没有一种单一的方法以一致的业绩主宰大多数任务;(3) 某些方法的改进随着较大规模的预先培训模式而减少;(4) 不同方法的收益往往是互补的,最佳组合模式的接近于一个强有力的完全监督的基线。我们开源了我们的工具包,即很少的NLU,与一些最先进的方法一起执行我们的评价框架。