We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.
翻译:我们提出了一种新颖的框架,用于评估自然语言规划与其预期行为之间的一致性。该框架利用大型语言模型(LLMs)将自然语言规划转换为克里普克结构和线性时序逻辑(LTL),并进行模型检验。我们在PlanBench规划验证数据集的简化版本上系统评估了该框架,并报告了准确率、精确率、召回率和F1分数等指标。实验结果表明,GPT-5在分类任务中表现出色(F1分数达96.3%),且几乎总能生成语法完美的形式化表示,可作为行为保证。然而,语义完美的形式化模型合成仍是未来探索的方向。