With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.


翻译:随着数据隐私法规的日益普及,从已训练模型中擦除私有或受版权保护信息的能力已成为一项关键需求。传统的遗忘方法通常假设能够访问完整的训练数据集,这在源数据不再可用的场景中是不现实的。为应对这一挑战,我们提出了一种认证遗忘框架,该框架能够实现有效的数据移除,而无需访问原始训练数据样本。我们的方法利用一个替代数据集来近似源数据的统计特性,从而允许基于两者之间的统计距离进行可控的噪声缩放。虽然我们的理论保证假设已知精确的统计距离,但实际实现通常近似估计该距离,这可能导致隐私保证有所减弱,但仍然具有实际意义。这确保了模型在遗忘后行为具有强有力的保证,同时保持其整体效用。我们建立了理论界限,引入了实用的噪声校准技术,并通过在合成和真实世界数据集上的大量实验验证了我们的方法。结果证明了我们的方法在隐私敏感环境中的有效性和可靠性。

0
下载
关闭预览

相关内容

数据集,又称为资料集、数据集合或资料集合,是一种由数据所组成的集合。
Data set(或dataset)是一个数据的集合,通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量,如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数,该数据集的数据可能包括一个或多个成员。
Top
微信扫码咨询专知VIP会员