大型语言模型在评估中作弊程度几何？基于一次性密码本框架的高估现象基准测试 (How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework)

Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.

翻译：大型语言模型（LLMs）评估中的高估现象日益引发关注。由于公开基准测试数据的污染或模型训练的不平衡性，LLMs可能在公开基准测试中获得失真的评估结果——无论是有意还是无意——这导致LLMs之间的不公平比较，并削弱了对其实际能力的评估。现有基准测试试图通过永久保密测试案例、通过人工评估减轻数据污染，或反复收集构建新样本等方式解决这些问题。然而，这些方法无法同时确保可复现性、透明度和高效率。此外，当前LLMs的高估程度仍未被量化。为解决这些问题，我们提出ArxivRoll——一个受密码学中一次性密码本加密启发的动态评估框架。ArxivRoll包含两个核心组件：\emph{i) SCP（序列化、完形填空与预测）}——一种私有测试案例的自动化生成器；\emph{ii) 鲁棒性评分（RS）}——用于衡量公开基准污染程度和训练偏差的指标。借助SCP，ArxivRoll每六个月使用arXiv的最新文章构建全新基准测试，并以此对LLM性能进行一次性评估。大量实验证明了我们基准测试的高质量，并对当前主流LLMs进行了系统性评估。源代码已发布于https://github.com/liangzid/ArxivRoll/。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日