While Large Language Models (LLMs) excel at code generation, their inherent tendency toward verbatim memorization of training data introduces critical risks like copyright infringement, insecure emission, and deprecated API utilization, etc. A straightforward yet promising defense is unlearning, ie., erasing or down-weighting the offending snippets through post-training. However, we find its application to source code often tends to spill over, damaging the basic knowledge of programming languages learned by the LLM and degrading the overall capability. To ease this challenge, we propose PROD for precise source code unlearning. PROD surgically zeroes out the prediction probability of the prohibited tokens, and renormalizes the remaining distribution so that the generated code stays correct. By excising only the targeted snippets, PROD achieves precise forgetting without much degradation of the LLM's overall capability. To facilitate in-depth evaluation against PROD, we establish an unlearning benchmark consisting of three downstream tasks (ie., unlearning of copyrighted code, insecure code, and deprecated APIs), and introduce Pareto Dominance Ratio (PDR) metric, which indicates both the forget quality and the LLM utility. Our comprehensive evaluation demonstrates that PROD achieves superior overall performance between forget quality and model utility compared to existing unlearning approaches across three downstream tasks, while consistently exhibiting improvements when applied to LLMs of varying series. PROD also exhibits superior robustness against adversarial attacks without generating or exposing the data to be forgotten. These results underscore that our approach not only successfully extends the application boundary of unlearning techniques to source code, but also holds significant implications for advancing reliable code generation.
翻译:尽管大语言模型在代码生成方面表现出色,但其对训练数据的逐字记忆倾向带来了严重风险,例如版权侵权、不安全代码生成以及过时API调用等。一种直接而有效的防御方法是遗忘学习,即通过后训练过程删除或降低有害代码片段的影响。然而,我们发现该方法应用于源代码时往往会产生溢出效应,损害大语言模型已掌握的基础编程语言知识,并降低其整体能力。为应对这一挑战,我们提出PROD方法以实现精确的源代码遗忘学习。PROD通过手术式地将禁止标记的预测概率归零,并对剩余概率分布进行重归一化,从而确保生成的代码保持正确性。通过仅剔除目标代码片段,PROD实现了精确遗忘,且不会显著降低大语言模型的整体能力。为深入评估PROD方法,我们构建了一个包含三个下游任务的遗忘学习基准测试(即受版权保护代码、不安全代码及过时API的遗忘),并引入帕累托优势比指标,该指标同时反映遗忘质量与模型效用。综合评估表明,在三个下游任务中,与现有遗忘学习方法相比,PROD在遗忘质量与模型效用之间实现了更优的整体性能,且在不同系列的大语言模型上均表现出稳定的改进效果。PROD还展现出对抗对抗攻击的卓越鲁棒性,且无需生成或暴露待遗忘数据。这些结果证明,我们的方法不仅成功将遗忘学习技术的应用边界扩展至源代码领域,还对推进可靠代码生成具有重要启示意义。