Evaluation of models on benchmarks is unreliable without knowing the degree of sample hardness; this subsequently overestimates the capability of AI systems and limits their adoption in real world applications. We propose a Data Scoring task that requires assignment of each unannotated sample in a benchmark a score between 0 to 1, where 0 signifies easy and 1 signifies hard. Use of unannotated samples in our task design is inspired from humans who can determine a question difficulty without knowing its correct answer. This also rules out the use of methods involving model based supervision (since they require sample annotations to get trained), eliminating potential biases associated with models in deciding sample difficulty. We propose a method based on Semantic Textual Similarity (STS) for this task; we validate our method by showing that existing models are more accurate with respect to the easier sample-chunks than with respect to the harder sample-chunks. Finally we demonstrate five novel applications.
翻译:对基准模型的评估在不了解抽样的严格程度的情况下是不可靠的;这随后高估了AI系统的能力,并限制了其在现实世界应用中的采用。我们提议了一项数据分类任务,要求在基准中分配每个未加注的样本,分数在0:1之间,0表示简单,1表示硬。在任务设计中使用未加注的样本,是人类的灵感,他们可以在不知道正确答案的情况下确定问题的困难。这也排除了使用基于模型的监督方法(因为它们需要样本说明才能接受培训),消除在确定样本困难时与模型相关的潜在偏差。我们提议了以语义文本相似性为基础的方法。我们验证了我们的方法,我们通过显示现有模型比较难的样本槽更精确,比较难的样本槽更精确。最后,我们展示了五个新的应用。