Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
翻译:在大规模语料库上训练的大语言模型存在记忆敏感或有害内容的内在风险,这些内容可能在其后续输出中重新出现。主流的遗忘方法通常依赖于梯度上升及其变体来降低特定目标响应的概率。然而,我们发现该策略会引发一个关键副作用:概率质量被重新分配到高似然区域,这些区域通常对应于目标的语义相关重述。我们将此称为挤压效应,这解释了为何许多方法仅产生虚假遗忘,而自动化指标(如ROUGE、真实比率)误报实际成功率的问题进一步掩盖了该缺陷。为解决此问题,我们提出了一种自举(BS)框架,该框架明确地将挤压效应与模型自身的高置信度生成(即其模型信念)联系起来。由于模型信念本质上捕捉了概率质量被挤压到的高似然区域,将其纳入遗忘目标可直接对抗挤压效应。通过联合抑制目标响应和模型信念,BS-T(令牌)衰减高概率令牌,而BS-S(序列)则移除整个高置信度生成序列,从而在保持实用性的同时实现更彻底的遗忘。在不同模型家族和多样化基准测试上进行的大量实验证实了我们方法的有效性。