A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error.
翻译:在现代机器学习中,一个令人惊讶的现象是,一个高度超度的模型能够广泛推广(测试数据上的小错误),即使它受过训练可以对培训数据进行记忆化(培训数据上零差),但这种模型却能够使培训数据(培训数据上零差)具有广泛性。这导致了军备竞赛,使模型的模型变得日益超度化(c.f.,深层学习)。在本文中,我们研究了一个未得到充分探讨的超度化隐蔽成本:过度分度模型更容易受到隐私攻击,特别是会籍推论攻击,预测了用于培训模型的(潜在敏感)实例。我们大大扩展了这个问题相对较少的经验性结果。我们从理论上证明一个超度分度的线性回归模型,而Gausian数据表明,成员推断脆弱性随参数数的增加而增加。此外,一系列经验研究表明,更复杂、非线性模型表现出同样的行为。最后,我们研究了在过分精确化制度中减轻这类攻击的不同方法,例如噪音增加和正规化,我们的结论是,仅仅减少过度分化模型的参数是保护其成员不受普遍误差的有效战略。