Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce label-only membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that label-only attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our label-only attacks demonstrate that confidence-masking is not a viable defense strategy against membership inference. Finally, we investigate worst-case label-only attacks, that infer membership for a small number of outlier data points. We show that label-only attacks also match confidence-based attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.
翻译:身份推断攻击是机器学习模型最简单的隐私泄漏形式之一: 给一个数据点和模型, 确定该点是否用于培训模型。 现有成员推断攻击利用了模型在培训数据被询问时的异常信任度。 如果对手只获得模型预测的标签, 而没有采取建立信任措施, 这些攻击就不适用。 在本文中, 我们引入了仅使用标签的会员推断攻击。 我们的攻击不是依靠信任评分, 而是评估模型在扰动中预测的隐私标签是否稳健, 以获得精细的会员识别信号。 这些干扰包括共同的数据增强或对抗性例子。 我们实验性地显示, 仅使用标签的会员推断攻击与先前的攻击相同, 需要获得模型信任度。 我们进一步证明, 仅使用标签式的攻击打破了对会员推断攻击的多重防御, 而我们称之为信任度的掩盖现象。 这些防御只修改模型的所有差异评分, 以挫败袭击为目的, 但却保留模型的准确性标签标签上的标签。 我们最坏的防御性攻击行为最后显示, 我们最坏的标签式攻击行为表明, 我们的标签式攻击行为表明, 正确的攻击是真实性攻击行为表明, 我们的策略显示, 正确的攻击是用来显示, 正确的攻击是真实性攻击。