基于大语言模型信念的大语言模型遗忘方法 (LLM Unlearning with LLM Beliefs)

Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.

翻译：在大规模语料库上训练的大语言模型存在记忆敏感或有害内容的内在风险，这些内容可能在其后续输出中重新出现。主流的遗忘方法通常依赖于梯度上升及其变体来降低特定目标响应的概率。然而，我们发现该策略会引发一个关键副作用：概率质量被重新分配到高似然区域，这些区域通常对应于目标的语义相关重述。我们将此称为挤压效应，这解释了为何许多方法仅产生虚假遗忘，而自动化指标（如ROUGE、真实比率）误报实际成功率的问题进一步掩盖了该缺陷。为解决此问题，我们提出了一种自举（BS）框架，该框架明确地将挤压效应与模型自身的高置信度生成（即其模型信念）联系起来。由于模型信念本质上捕捉了概率质量被挤压到的高似然区域，将其纳入遗忘目标可直接对抗挤压效应。通过联合抑制目标响应和模型信念，BS-T（令牌）衰减高概率令牌，而BS-S（序列）则移除整个高置信度生成序列，从而在保持实用性的同时实现更彻底的遗忘。在不同模型家族和多样化基准测试上进行的大量实验证实了我们方法的有效性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日