Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have "forgotten" the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose Proximal Surrogate Generation (PSG), an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$β$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.
翻译:机器遗忘旨在从训练模型中移除特定数据的影响,这一能力对于遵守版权法规和确保人工智能安全至关重要。当前遗忘评估指标通常通过监测模型在特定遗忘数据集($D_u$)上的性能下降来衡量成功。我们认为,对于大型语言模型(LLMs)而言,这种评估范式是不充分的,且可能产生误导。许多现实世界中的遗忘应用——基于版权或安全动机——不仅隐式地针对$D_u$中的逐字内容,还针对模型从其衍生出的更广泛泛化所影响的行为。我们证明,LLMs能够通过标准遗忘评估并看似已"遗忘"目标知识,同时却在语义上与$D_u$相邻的内容上保持强大能力。这一现象表明,擦除精确句子并不等同于移除底层知识。为弥补这一缺陷,我们提出近端替代生成(PSG),这是一个自动化的压力测试框架,可生成替代数据集$\tilde{D}_u$。该替代集被构建为语义上源自$D_u$,但在嵌入空间中具有足够区分度。通过比较$D_u$与$\tilde{D}_u$之间的遗忘指标得分,我们可以对指标本身的可靠性进行压力测试。我们在三个LLM系列(Llama-3-8B、Qwen2.5-7B和Zephyr-7B-$β$)、三个不同数据集和七个标准指标上的广泛评估揭示了普遍的不一致性。我们发现,当前指标经常高估遗忘成功率,未能检测出通过我们压力测试数据集暴露的残留知识。