Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.
翻译:大型语言模型(LLM)正日益应用于实际代码生成任务中,然而仅实现功能正确性不足以确保可靠部署,开发者还期望模型能够遵循关于鲁棒性、格式规范和安全性的明确要求。现有基准主要通过测试用例执行来评估正确性,对模型遵循此类约束的可靠性提供的信息有限。我们提出了一个包含1000个Python任务的基准数据集,每个任务平均配备7条开发者指定的约束,涵盖13个类别。这些约束通过四阶段人机协同流程进行筛选,确保其具备原子性、相关性和客观性。我们使用互补的遵循度指标评估了14个开源与闭源模型,并提出了C2A分数——一种同时捕获正确性与约束遵循度的复合度量。结果表明,部分遵循与严格遵循之间存在显著差距:虽然性能优异的模型能实现超过90%的部分遵循率,但其严格遵循率仅为39%至66%。这些发现表明,可信赖的代码生成不仅需要正确性,还必须持续遵循开发者意图。