Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task. Despite this finding, benchmarking, while trying to tackle bias, still relies on workarounds, which do not fully utilize the resources invested in benchmark creation, due to the discarding of low quality data, and cover limited sets of bias. A potential solution to these issues -- a metric quantifying quality -- remains underexplored. Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark. We look for bias related parameters which can potentially help pave our way towards the metric. We survey existing works and identify parameters capturing various properties of bias, their origins, types and impact on performance, generalization, and robustness. Our analysis spans over datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring that our parameters are generic and are not overfitted towards a specific task or dataset. We also develop certain parameters in this process.
翻译:为了跟踪我们在《国家劳工规划》方面的进展,已经投入了大量资源,建立了若干基准。根据这些基准出版的数千份文件与顶级领导板竞争,其模型往往超过人类业绩。然而,最近的研究表明,模型仅仅通过过度适应虚假偏见而战胜了几个受欢迎的基准,而没有真正了解所期望的任务。尽管有这一发现,但基准在试图解决偏见的同时,仍然依赖工作变通办法,这些变通办法没有充分利用在创建基准方面投入的资源,因为这些办法由于放弃低质量数据,涵盖有限的偏见。这些问题的潜在解决办法 -- -- 量化质量指标 -- -- 仍然没有得到充分探讨。在诸如电力、食品和水等若干领域成功的质量指数的激励下,我们迈出了走向衡量标准的第一步,确定了某些语言属性,这些属性可以代表各种可能的相互作用,导致在基准中出现偏差。我们寻找与偏见有关的参数,这些参数可能有助于为制定衡量标准铺平道路。我们调查现有的工程,并确定了反映各种偏差特性、其来源、类型和对业绩的影响、总体化和稳健性的参数。我们的分析跨越了数据设置的范围,而我们的分析又没有超越了在诸如数据、食品和水等特定任务参数的层次上,我们没有确定具体的参数。