More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans. This has left little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this work, we study the impact of applying each approach to create more challenging evaluation datasets. We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models. We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used. However, the resulting ranking of models can also be unstable and highly sensitive to the choice of adversary model used. Moreover, AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the most contentiously labeled examples. Smaller-scale experiments on the adversarially collected datasets ANLI and AdversarialQA show similar findings, broadly lowering performance with stronger adversaries while disproportionately affecting the adversary model.
翻译:更有能力的语言模型日益饱和现有任务基准,在某些情况下,比人的工作成绩要好。这给测量进一步进展留下了很少的机会。建议作为建立更具挑战性的数据集的战略,建立反向数据集,有两种共同的方法:(1) 过滤简单实例,(2) 边际模型数据收集。在这项工作中,我们研究应用每一种方法来创建更具挑战性的评价数据集的影响。我们调整了阿拉利特算法以过滤评估数据,并对18个不同的对手模型进行实验。我们发现,阿拉利特确实选择了更具挑战性的例子,降低了被评估模型的性能,因为使用的是更强大的对手模型。然而,由此产生的模型排名也可能不稳定,而且对所使用对手模型的选择非常敏感。此外,亚利特的过度抽样实例与低标注协议,这意味着模型比较取决于最有争议的标注实例。关于对立收集的数据集的小型实验显示了相似的结果,大大降低了对等性,同时对准模型的影响也不成比例。