PACIFIC：一种用于生成基准测试以检验代码中精确自动检查指令遵循能力的框架 (PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code)

Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI, demonstrating impressive capabilities in code generation and comprehension. A key requirement for these systems is their ability to accurately follow user instructions. We present Precise Automatically Checked Instruction Following In Code (PACIFIC), a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities in LLMs, while allowing control over benchmark difficulty. PACIFIC produces benchmark variants with clearly defined expected outputs, enabling straightforward and reliable evaluation through simple output comparisons. In contrast to existing approaches that often rely on tool usage or agentic behavior, our work isolates and evaluates the LLM's intrinsic ability to reason through code behavior step-by-step without execution (dry running) and to follow instructions. Furthermore, our framework mitigates training data contamination by facilitating effortless generation of novel benchmark variations. We validate our framework by generating a suite of benchmarks spanning a range of difficulty levels and evaluating multiple state-of-the-art LLMs. Our results demonstrate that PACIFIC can produce increasingly challenging benchmarks that effectively differentiate instruction-following and dry running capabilities, even among advanced models. Overall, our framework offers a scalable, contamination-resilient methodology for assessing core competencies of LLMs in code-related tasks.

翻译：基于大型语言模型（LLM）的代码助手已成为生成式人工智能的重要应用，在代码生成与理解方面展现出卓越能力。此类系统的核心要求在于其能否准确遵循用户指令。本文提出精确自动检查指令遵循代码框架（PACIFIC），这是一种新颖的框架，旨在自动生成能够严格评估LLM顺序指令遵循与代码空运行能力的基准测试，同时支持对测试难度的灵活控制。PACIFIC生成的基准测试变体具有明确定义的预期输出，可通过简单的结果比对实现直接可靠的评估。与现有常依赖工具使用或代理行为的方法不同，本研究聚焦于评估LLM在不执行代码的情况下逐步推理代码行为（空运行）及遵循指令的内在能力。此外，该框架通过便捷生成新颖的基准测试变体，有效缓解了训练数据污染问题。我们通过生成涵盖多难度级别的基准测试套件并对多个前沿LLM进行评估，验证了该框架的有效性。实验结果表明，PACIFIC能够生成难度递增的基准测试，有效区分不同模型在指令遵循与空运行能力上的差异，即使对于先进模型亦如此。总体而言，本框架为评估LLM在代码相关任务中的核心能力提供了一种可扩展且抗污染的方法论。