Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.
翻译:大型语言模型(LLMs)评估中的高估现象日益引发关注。由于公开基准测试数据的污染或模型训练的不平衡性,LLMs可能在公开基准测试中获得失真的评估结果——无论是有意还是无意——这导致LLMs之间的不公平比较,并削弱了对其实际能力的评估。现有基准测试试图通过永久保密测试案例、通过人工评估减轻数据污染,或反复收集构建新样本等方式解决这些问题。然而,这些方法无法同时确保可复现性、透明度和高效率。此外,当前LLMs的高估程度仍未被量化。为解决这些问题,我们提出ArxivRoll——一个受密码学中一次性密码本加密启发的动态评估框架。ArxivRoll包含两个核心组件:\emph{i) SCP(序列化、完形填空与预测)}——一种私有测试案例的自动化生成器;\emph{ii) 鲁棒性评分(RS)}——用于衡量公开基准污染程度和训练偏差的指标。借助SCP,ArxivRoll每六个月使用arXiv的最新文章构建全新基准测试,并以此对LLM性能进行一次性评估。大量实验证明了我们基准测试的高质量,并对当前主流LLMs进行了系统性评估。源代码已发布于https://github.com/liangzid/ArxivRoll/。