Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce $9$ statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.
翻译:数据集是关键的培训资源和示范性业绩跟踪者。然而,现有的数据集暴露了众多问题,诱发了偏向模式和不可靠的评价结果。在本文件中,我们提出了自动数据集质量评价的模型不可知数据集评价框架。我们寻求数据集的统计特性,并处理三个基本方面:可靠性、难度和有效性,遵循经典测试理论。以实体识别数据集为案例研究,我们为统计数据集评价框架引入了9万美元的统计指标。实验结果和人类评价证实,我们的评价框架有效地评估了数据集质量的各个方面。此外,我们研究我们的统计指标中的数据集分数如何影响模型的性能,呼吁在培训或测试模型之前进行数据集质量评价或目标数据集改进。