保持代码感知型大语言模型的新鲜度：完全刷新、上下文增量与增量微调 (Keeping Code-Aware LLMs Fresh: Full Refresh, In-Context Deltas, and Incremental Fine-Tuning)

Modern codebases evolve continuously: files are renamed or deleted; public APIs drift; behavior shifts within otherwise familiar modules. A model trained yesterday to map a developer's natural-language question to the exact set of repository file paths that matter will degrade tomorrow, even if the questions themselves look unchanged. In this paper we study, at system scale and across several widely used repositories, how to keep such a model fresh without surrendering retention on earlier code. We frame freshness as a form of domain drift between a base snapshot and the current HEAD, and we compare three families of update strategies: (A) Full Refresh, retraining the entire model at the new snapshot; (B) In-Context Learning (ICL) that injects recent deltas (raw git diffs or concise English summaries) at inference; and (C) Incremental Fine-Tuning (Inc-FT) on delta-derived training sets, with carefully controlled NEW:OLD mixing to mitigate catastrophic forgetting. We contribute an alias-aware evaluation protocol that credits rename while never rewarding deleted paths, and a practical Forgetting Probe that quantifies residual emissions of obsolete paths. Across Flask, SQLAlchemy, Pandas, and Poetry, Inc-FT with old-aware mixes delivers the best overall balance on mixed sets, ICL with English delta summaries delivers the fastest new-code lift when training is not feasible, and Full Refresh remains the ceiling when maximum NEW accuracy matters. We also compare Git-diff Inc-FT to full-file Inc-FT, showing that diffs excel in rename/delete-heavy windows while full-file context wins in behavior-change-heavy windows.

翻译：现代代码库持续演进：文件被重命名或删除；公共API发生漂移；熟悉模块内的行为发生改变。昨日训练用于将开发者自然语言问题映射至相关仓库文件路径的模型，即使问题本身看似未变，其性能也会在明日下降。本文在系统规模上，针对多个广泛使用的代码库，研究了如何在保持对早期代码记忆的同时维持模型的新鲜度。我们将新鲜度定义为基准快照与当前HEAD之间的领域漂移，并比较了三类更新策略：(A) 完全刷新，在新快照上重新训练整个模型；(B) 上下文学习，在推理时注入近期增量（原始git差异或简洁英文摘要）；(C) 增量微调，基于增量衍生的训练集进行，通过精心控制的新旧数据混合以缓解灾难性遗忘。我们提出了一种别名感知评估协议，该协议正确识别重命名操作且从不奖励已删除路径，以及一种实用的遗忘探针，用于量化过时路径的残留输出。在Flask、SQLAlchemy、Pandas和Poetry上的实验表明：采用旧数据感知混合的增量微调在混合数据集上实现了最佳整体平衡；当训练不可行时，带有英文增量摘要的上下文学习能最快提升新代码性能；而当追求最高新代码准确率时，完全刷新仍是性能上限。我们还比较了基于Git差异的增量微调与基于完整文件的增量微调，结果显示差异方法在重命名/删除频繁的窗口表现优异，而完整文件上下文在行为变更密集的窗口更具优势。