反对建评价组更具挑战性, (Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair)

More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans. This has left little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this work, we study the impact of applying each approach to create more challenging evaluation datasets. We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models. We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used. However, the resulting ranking of models can also be unstable and highly sensitive to the choice of adversary model used. Moreover, AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the most contentiously labeled examples. Smaller-scale experiments on the adversarially collected datasets ANLI and AdversarialQA show similar findings, broadly lowering performance with stronger adversaries while disproportionately affecting the adversary model.

翻译：更有能力的语言模型日益饱和现有任务基准,在某些情况下,比人的工作成绩要好。这给测量进一步进展留下了很少的机会。建议作为建立更具挑战性的数据集的战略,建立反向数据集,有两种共同的方法:(1) 过滤简单实例,(2) 边际模型数据收集。在这项工作中,我们研究应用每一种方法来创建更具挑战性的评价数据集的影响。我们调整了阿拉利特算法以过滤评估数据,并对18个不同的对手模型进行实验。我们发现,阿拉利特确实选择了更具挑战性的例子,降低了被评估模型的性能,因为使用的是更强大的对手模型。然而,由此产生的模型排名也可能不稳定,而且对所使用对手模型的选择非常敏感。此外,亚利特的过度抽样实例与低标注协议,这意味着模型比较取决于最有争议的标注实例。关于对立收集的数据集的小型实验显示了相似的结果,大大降低了对等性,同时对准模型的影响也不成比例。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

AAAI 2022接收论文列表发布，1349篇论文都在这了！

专知会员服务

146+阅读 · 2022年1月11日

NeurIPS 20201接收论文列表发布，2334篇论文都在这了！

专知会员服务

38+阅读 · 2021年11月4日

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

专知会员服务

24+阅读 · 2020年4月7日

【ACL2020-Facebook AI】大规模无监督跨语言表示学习

专知会员服务

34+阅读 · 2020年4月5日