RegexPSPACE：一个用于评估LLM在PSPACE完全正则表达式问题上推理能力的基准 (RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems)

Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .

翻译：大型语言模型（LLM）在自然语言处理（NLP）、数学推理和编程方面表现出强大的性能，而近期的大型推理模型（LRM）进一步强调了显式推理能力。然而，它们的计算极限，特别是受有限上下文窗口约束的空间复杂性，仍然缺乏深入理解。尽管近期研究多集中于NP复杂性类别内的问题，我们通过引入一个基于两个PSPACE完全正则表达式问题的新颖基准来突破这一边界：等价性判定（RegexEQ）和最小化（RegexMin）。PSPACE完全问题作为评估计算能力的更严格标准，因为其求解需要探索巨大的搜索空间。我们通过双指数级空间探索，结合可靠的过滤流程，构建了一个包含超过一百万正则表达式实例的标注数据集以建立该基准。我们对6个不同规模的LLM和5个LRM进行了广泛评估，揭示了诸如冗长性和重复性等常见失败模式。凭借其清晰的结构和定量评估指标，本研究首次对LLM和LRM的空间计算局限性进行了实证探究，为评估其高级推理能力提供了一个新框架。我们的代码可在 https://github.com/hyundong98/RegexPSPACE 获取。