A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models may be more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model in the Gaussian data setting that membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we extend our analysis towards ridge-regularized linear regression and show in the Gaussian data setting that increased regularization also increases membership inference vulnerability in the overparameterized regime.
翻译:在现代机器学习中,一个令人惊讶的现象是,一个高度超度的模型即使经过训练可以对培训数据进行记忆化(培训数据零误),也能非常概括化(测试数据上的小错误),这导致军备竞赛,使模型越来越过分分度化(c.f.,深层学习),在本文中,我们研究了超度分化的隐蔽成本:过度分度模型可能更容易受到隐私攻击,特别是会籍推论攻击,预测了用于培训模型的(潜在敏感)实例。我们从理论上证明高斯数据中存在一种过分分度化的线性回归模型,表明成员脆弱性随着参数数的增加而增加,从而大大扩展了这个问题的相对较少的经验结果。此外,一系列经验研究表明,更为复杂的非线性模型也表现出了同样的行为。最后,我们把我们的分析扩大到了里程正规化的线性回归,并在戈西亚数据设置中显示,增加的正规化也增加了过分分度化制度中的推推论脆弱性。