Large language models (LLMs) have made rapid progress in formal theorem proving, yet current benchmarks under-measure the kind of abstraction and library-mediated reasoning that organizes modern mathematics. In parallel with FATE's emphasis on frontier algebra, we introduce LeanCat, a Lean benchmark for category-theoretic formalization -- a unifying language for mathematical structure and a core layer of modern proof engineering -- serving as a stress test of structural, interface-level reasoning. Part I: 1-Categories contains 100 fully formalized statement-level tasks, curated into topic families and three difficulty tiers via an LLM-assisted + human grading process. The best model solves 8.25% of tasks at pass@1 (32.50%/4.17%/0.00% by Easy/Medium/High) and 12.00% at pass@4 (50.00%/4.76%/0.00%). We also evaluate LeanBridge which use LeanExplore to search Mathlib, and observe consistent gains over single-model baselines. LeanCat is intended as a compact, reusable checkpoint for tracking both AI and human progress toward reliable, research-level formalization in Lean.
翻译:大型语言模型在形式定理证明领域进展迅速,然而当前的基准测试未能充分衡量组织现代数学所需的抽象化与基于库的推理能力。与 FATE 强调前沿代数的理念相并行,我们引入了 LeanCat——一个用于范畴论形式化的 Lean 基准测试套件。范畴论作为数学结构的统一语言和现代证明工程的核心层次,该基准旨在对结构性与接口层面的推理能力进行压力测试。第一部分:1-范畴 包含 100 个完全形式化的命题级任务,通过 LLM 辅助与人工分级流程,将其整理为主题族和三个难度等级。最佳模型在 pass@1 下解决了 8.25% 的任务(按简单/中等/高难度分别为 32.50%/4.76%/0.00%),在 pass@4 下解决了 12.00% 的任务(50.00%/4.76%/0.00%)。我们还评估了使用 LeanExplore 在 Mathlib 中进行搜索的 LeanBridge 方法,并观察到其相对于单模型基线的一致性能提升。LeanCat 旨在作为一个紧凑、可复用的检查点,用于追踪人工智能与人类在 Lean 中实现可靠、研究级形式化方面的进展。