In NLP, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto standard when presenting new models or benchmarks. We present a tool that researchers can use to study properties of the dataset and the influence of those properties on their models' behaviour. Our Text Characterization Toolkit includes both an easy-to-use annotation tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from three different domains: we use the tool to predict what are difficult examples for given well-known trained models and identify (potentially harmful) biases and heuristics that are present in a dataset.
翻译:在《国家劳工政策》中,模型通常通过报告一些现成基准的单数字性能分数来评价,而没有进行更深入的分析。在这里,我们争辩说,特别是众所周知的基准往往含有偏见、手工艺品和虚假的关联性,更深层次的结果分析在提出新的模型或基准时应成为实际标准。我们提供了一个研究人员可以用来研究数据集属性和这些属性对其模型行为的影响的工具。我们的文本特征工具包包括一个易于使用的注释工具,以及可用于具体分析的现成脚本。我们还介绍了三个不同领域的使用案例:我们利用这一工具预测哪些是已知的训练有素模型的难例,并查明数据集中存在的(潜在有害的)偏见和外观。