通过自然语料困惑度映射基准测试的重叠区域 (Mapping Overlaps in Benchmarks through Perplexity in the Wild)

We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

翻译：我们开发了容量熟悉度特征来刻画大语言模型（LLM）基准测试及其有意义的重叠区域。基准特征通过探究基准性能所需的模型容量来构建。我们将其形式化定义为从自然语料库中提取的显著标记集合，其中LLM标记困惑度（反映预训练暴露程度）能高度预测LLM基准性能。通过大规模元评估，我们在涵盖知识、编码、逻辑、指令遵循、数学、语言、推理和世界建模等领域的32个LLM和88个基准测试上，采用逐步前向选择与线性回归方法提取基准特征。我们的分析将特征置于基准问题语义相似度与模型性能相关性的双重关系中。虽然性能重叠普遍较高而语义重叠仅局限于狭窄的中段范围，但基准特征在捕捉变异、重叠和分化方面表现出高度信息量。我们观察到知识与推理子任务存在重叠，而多语言和文化基准则表现出较低相似性——甚至低于跨任务重叠度。值得注意的是，性能层面的结果受基准正交因素（如问题格式）的强烈影响，这凸显了LLM泛化的局限性、性能与能力的混淆问题，以及当前主流基准一致性研究的内在缺陷。然而基准特征对此类影响保持稳健。最终，我们发现了逻辑、数学、语言、指令遵循和世界建模领域的跨功能重叠，而编码领域则呈现出最低的重叠度。这些发现共同为基准有效性和LLM敏感性提供了机制性见解，并勾勒出相互关联的LLM能力底层图谱。