Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model $\mathcal{V}$ -- as the lack of $\mathcal{V}$-$\textit{usable information}$ (Xu et al., 2019), where a lower value indicates a more difficult dataset for $\mathcal{V}$. We further introduce $\textit{pointwise $\mathcal{V}$-information}$ (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, $\mathcal{V}$-$\textit{usable information}$ and PVI also permit the converse: for a given model $\mathcal{V}$, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.
翻译:估计一个数据集的难度通常涉及将最先进的模型与人类进行比较;性能差距越大,数据集就越难。然而,这种比较使人们对特定分布中每个实例的难度或使给定模式的数据集难于使用什么属性表示不甚理解。为了解决这些问题,我们设置了数据集难度 -- -- w.r.t. 模型$mathcal{V}{V}美元 -- -- 因为缺少美元=mathcal{V}-$\textit{可使用的基准}$(Xu et al., 2019) (Xu et al., 2019) -- -- 低值显示美元=mathcal{V} 的数据集难度越大。我们进一步引入 $\ textitit{point wixy $\mathcal{V}- info}$(PVI) 来衡量单个案例的难度。虽然标准评价指标通常只比较同一数据集的不同模型的不同模型的模型, $\math calice{$-textitrifliet flefile} $和PVI 也允许对不同的数据进行对比。