擦除幻觉：压力测试LLM遗忘评估的泛化能力 (The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation)

Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have "forgotten" the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose Proximal Surrogate Generation (PSG), an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$β$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.

翻译：机器遗忘旨在从训练模型中移除特定数据的影响，这一能力对于遵守版权法规和确保人工智能安全至关重要。当前遗忘评估指标通常通过监测模型在特定遗忘数据集（$D_u$）上的性能下降来衡量成功。我们认为，对于大型语言模型（LLMs）而言，这种评估范式是不充分的，且可能产生误导。许多现实世界中的遗忘应用——基于版权或安全动机——不仅隐式地针对$D_u$中的逐字内容，还针对模型从其衍生出的更广泛泛化所影响的行为。我们证明，LLMs能够通过标准遗忘评估并看似已"遗忘"目标知识，同时却在语义上与$D_u$相邻的内容上保持强大能力。这一现象表明，擦除精确句子并不等同于移除底层知识。为弥补这一缺陷，我们提出近端替代生成（PSG），这是一个自动化的压力测试框架，可生成替代数据集$\tilde{D}_u$。该替代集被构建为语义上源自$D_u$，但在嵌入空间中具有足够区分度。通过比较$D_u$与$\tilde{D}_u$之间的遗忘指标得分，我们可以对指标本身的可靠性进行压力测试。我们在三个LLM系列（Llama-3-8B、Qwen2.5-7B和Zephyr-7B-$β$）、三个不同数据集和七个标准指标上的广泛评估揭示了普遍的不一致性。我们发现，当前指标经常高估遗忘成功率，未能检测出通过我们压力测试数据集暴露的残留知识。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2024】MoReVQA:探索视频问答的模块化推理模型

专知会员服务

18+阅读 · 2024年4月10日

【ICML2022】GALAXY:极化图主动学习

专知会员服务

31+阅读 · 2022年6月12日

【ICML2020】持续终身学习的神经主题建模

专知会员服务

39+阅读 · 2020年6月22日

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

专知会员服务

13+阅读 · 2019年11月17日