We study the problem of unlearning datapoints from a learnt model. The learner first receives a dataset $S$ drawn i.i.d. from an unknown distribution, and outputs a model $\widehat{w}$ that performs well on unseen samples from the same distribution. However, at some point in the future, any training datapoint $z \in S$ can request to be unlearned, thus prompting the learner to modify its output model while still ensuring the same accuracy guarantees. We initiate a rigorous study of generalization in machine unlearning, where the goal is to perform well on previously unseen datapoints. Our focus is on both computational and storage complexity. For the setting of convex losses, we provide an unlearning algorithm that can unlearn up to $O(n/d^{1/4})$ samples, where $d$ is the problem dimension. In comparison, in general, differentially private learning (which implies unlearning) only guarantees deletion of $O(n/d^{1/2})$ samples. This demonstrates a novel separation between differential privacy and machine unlearning.
翻译:学习者首先从未知的分布中获得一套数据集,即从未知的分布中抽取的美元,然后输出出一种模型,即用同一分布中未见的样本来很好地表现。然而,在将来的某个时候,任何培训数据点$z/in S$可以要求不吸取,从而促使学习者修改其产出模式,同时仍然确保同样的准确性保障。我们开始对机器不学习中的概括化进行严格研究,目的是在先前的未知数据点上取得良好效果。我们的重点是计算和存储复杂性。对于确定 convex损失,我们提供了一种可以解析到$O(n/d ⁇ 1/4})的未学习算法,其中美元是问题层面。相比之下,一般而言,差异性私人学习(这意味着不学习)只能保证删除$O(n/d ⁇ 1/2}样本。这显示了差异性隐私和机器不学习之间的新区分。