Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.
翻译:基准测试在推动人工智能发展中具有关键作用,而无效的基准问题常会削弱其可靠性。在数千个基准问题中手动识别并修正错误不仅不可行,更是实现可靠评估的关键瓶颈。本研究提出了一种系统性基准修订框架,该框架利用响应模式的统计分析来标记潜在无效问题,以供专家进一步审查。我们的方法基于AI评估中常用的核心假设,即平均分足以概括模型性能。这意味着测量实验背后存在一维潜在结构,从而为每个项目生成各类统计量的预期范围。当这些统计量的经验估计值超出项目的预期范围时,该项目更可能存在缺陷。在九个广泛使用的基准测试中,我们的方法通过指导专家审查,识别问题项目的精确度最高可达84%。此外,我们引入了LLM-judge初步审查机制来筛选问题,进一步减少了人力投入。这些组件共同构成了一个高效且可扩展的系统性基准修订框架。