In this paper, we ask the research question of whether all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all datasets with features explored in this work on DataLab: \url{https://datalab.nlpedia.ai}.
翻译:在本文中,我们询问基准中所有数据集是否都有必要的研究问题。我们在比较不同系统时首先通过说明数据集的可辨别性来处理这个问题。9个数据集和36个系统的实验表明,现有的几个基准数据集对顶层分解系统没有多大作用,而那些使用较少的数据集则表现出令人印象深刻的歧视性力量。我们进一步将文本分类任务作为案例研究,调查预测基于其属性(例如平均句长)的数据集歧视的可能性。我们的初步实验有希望地表明,如果培训了足够多的实验记录,就可以学会一个有意义的预测器来估计对隐蔽数据集的歧视。 我们发布了所有数据集,其特点在数据实验室的这项工作中得到了探讨:\url{https://datalab.nlpedia.ai}。