Data used to train machine learning (ML) models can be sensitive. Membership inference attacks (MIAs), attempting to determine whether a particular data record was used to train an ML model, risk violating membership privacy. ML model builders need a principled definition of a metric that enables them to quantify the privacy risk of (a) individual training data records, (b) independently of specific MIAs, (c) efficiently. None of the prior work on membership privacy risk metrics simultaneously meets all of these criteria. We propose such a metric, SHAPr, which uses Shapley values to quantify a model's memorization of an individual training data record by measuring its influence on the model's utility. This memorization is a measure of the likelihood of a successful MIA. Using ten benchmark datasets, we show that SHAPr is effective (precision: 0.94$\pm 0.06$, recall: 0.88$\pm 0.06$) in estimating susceptibility of a training data record for MIAs, and is efficient (computable within minutes for smaller datasets and in ~90 minutes for the largest dataset). SHAPr is also versatile in that it can be used for other purposes like assessing fairness or assigning valuation for subsets of a dataset. For example, we show that SHAPr correctly captures the disproportionate vulnerability of different subgroups to MIAs. Using SHAPr, we show that the membership privacy risk of a dataset is not necessarily improved by removing high risk training data records, thereby confirming an observation from prior work in a significantly extended setting (in ten datasets, removing up to 50% of data).
翻译:用于培训机器学习(ML) 模型的数据可以是敏感的。 身份推断攻击( MIAs), 试图确定是否使用特定的数据记录来培训 ML 模型, 有可能侵犯会员隐私。 ML 模型构建者需要一个原则性衡量标准定义, 使其能够量化(a) 个人培训数据记录, (b) 独立于特定 MIA 的隐私数据, (c) 效率。 先前关于会员隐私风险衡量标准的任何工作都没有同时满足所有这些标准。 我们建议采用这样一个衡量标准( SHAPr ), 使用 SHA 值来量化某个模型对个人培训数据记录的记忆化, 衡量其对模型效用的影响。 ML 模型构建者需要有一个原则性定义, 以衡量成功 MIA 数据记录的可能性。 使用10个基准数据集, 我们显示, SHAPr 有效( 精度: 0. 0. 94$ pm. 0. 06. 美元, 记得: 0.88\ pm 0.06 美元), 用于估算MIA 培训数据记录是否可靠, 的精确度, 并且高效( 可以在几分钟内对更小的 数据进行数据评估, 数据评估, IMA 更多的数据评估中, 也显示一个用于前数级数据评估, 的 Rrerealreabreal) 10 数据, 。