Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
翻译:了解神经网络对培训数据进行记忆化的程度是一个具有实际和理论意义的令人感兴趣的问题。在本文中,我们表明,在某些情况下,培训数据中有相当一部分事实上可以从训练有素的神经网络分类参数中重建。我们提出了一个新的重建计划,该计划源于最近关于使用梯度方法培训神经网络的隐性偏差的理论结果。据我们所知,我们的结果首先表明,从受过训练的神经网络分类器中重建大部分实际培训样本一般是可能的。这对隐私有负面影响,因为它可以用作对敏感培训数据披露的攻击。我们用少数标准的计算机视觉数据集展示了我们的二进制 MLP分类器方法。