LTLBench：面向大型语言模型时序逻辑推理能力评估的基准构建 (LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models)

Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely \LTL, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.

翻译：时序推理是大型语言模型理解时序信息及事件间关系的关键能力。为研究大型语言模型的时序推理能力，先前工作提出了多种评估该能力不同维度的方案。本研究提出通过线性时序逻辑评估时序推理能力的新视角，并开发了自动生成挑战性问题以评估大型语言模型时序推理能力的流程框架。基于该框架，我们构建了包含2000个时序推理挑战的数据集\LTL，并采用5种不同方法对12个大型语言模型进行基准测试。此外，我们通过增加公式运算符和事件数量的对照实验，探究了二者对大型语言模型性能及时序推理问题复杂性的影响。通过对模型推理过程以及事件数量与公式运算符变化效应的定性分析，我们揭示了其时序推理过程中存在的三个主要问题，以及随着问题复杂度提升出现的非预期性能变化现象。本研究期望为大型语言模型的时序推理能力提供有价值的见解。