超越准确率：代理式多代码块修复的行为动态研究 (Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair)

Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (39%-343% more tokens) and require longer execution time (43%-427%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves the repair accuracy of Gemini-cli by 30% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair.

翻译：自动化程序修复传统上聚焦于单代码块缺陷，忽视了现实系统中普遍存在的多代码块错误。修复此类错误需要在多个不连续的代码区域进行协调编辑，带来了显著更大的挑战。我们首次对LLM驱动的编码代理（Claude Code、Codex、Gemini-cli和Qwen Code）在此任务上进行了系统性研究。我们在Hunk4J数据集的372个多代码块错误上评估这些代理，通过细粒度指标分析了1,488条修复轨迹，这些指标涵盖定位能力、修复准确率、回归行为及操作动态。结果显示显著差异：修复准确率从25.8%（Qwen Code）到93.3%（Claude Code）不等，且随错误分散度和复杂度的增加持续下降。高性能代理展现出更优的语义一致性，实现了正向回归减少，而低性能代理常引入新的测试失败。值得注意的是，代理不会快速失败；失败的修复消耗显著更多资源（多消耗39%-343%的令牌数）且需要更长的执行时间（增加43%-427%）。此外，我们开发了Maple为代理提供仓库级上下文。实证结果表明，Maple通过增强定位能力将Gemini-cli的修复准确率提升了30%。通过细粒度指标和轨迹级分析，本研究超越准确率层面，揭示了编码代理在多代码块修复过程中如何定位、推理和行动。