Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.
翻译:自(例如,通用数据保护条例)近期数据保护法规的出台以来,越来越多的人要求在不重新训练的情况下,从敏感数据中删除预训练模型学习的信息。神经网络面对对抗攻击和不公平性问题的固有脆弱性也需要一种强大的方法来以逐实例的方式删除或纠正信息,同时保留跨余下数据的预测性能。为此,我们定义了逐实例未学习(instance-wise unlearning)的目标,即将一组实例中的信息从预训练模型中删除,通过将每个实例以错误分类或重新标记的方式来实现。我们还提出了两种减少余下数据中的遗忘的方法:1)利用对抗性示例来克服表示层面的遗忘,2)利用权重重要性度量来锁定传播不需要的信息的网络参数。这两种方法仅需要预训练模型和要遗忘的数据实例,可以轻松应用于现实场景中,其中整个训练集不可用。通过在各种图像分类基准测试中进行大量实验,我们展示了我们的方法在单任务和持续未学习场景中有效地保留余下数据的知识同时实现给定实例的未学习。