Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
翻译:适用于NLP的机器学习方法往往通过以一个数字来总结其性能来评价,例如准确性。由于大多数测试组都是根据总体数据作为i.d.样本构建的,这种方法过分简化了语言的复杂性,鼓励过度适应数据分布的领导。因此,在评价中,没有同样包括稀有语言现象或关于代表性不足群体的文本。为了鼓励更深入的模型分析,研究人员建议使用多个测试组,也称为挑战组,评估模型的具体能力。在本文中,我们根据这一想法制定了一个框架,能够产生受控的扰动,并找出文本到标语、文本到文本或数据到文本设置中的子。通过将这一框架应用于GEM生成的基准,我们建议用80个挑战组来进行评估,展示它能够促成的多种分析,并阐明当前生成模型的局限性。