The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most widely used methods for addressing class imbalance and generating synthetic data. Despite its popularity, little attention has been paid to its privacy implications; yet, it is used in the wild in many privacy-sensitive applications. In this work, we conduct the first systematic study of privacy leakage in SMOTE: We begin by showing that prevailing evaluation practices, i.e., naive distinguishing and distance-to-closest-record metrics, completely fail to detect any leakage and that membership inference attacks (MIAs) can be instantiated with high accuracy. Then, by exploiting SMOTE's geometric properties, we build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision and recall approaching one under realistic imbalance ratios. We also provide theoretical guarantees for both attacks. Experiments on eight standard imbalanced datasets confirm the practicality and effectiveness of these attacks. Overall, our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications.
翻译:合成少数类过采样技术(SMOTE)是处理类别不平衡和生成合成数据最广泛使用的方法之一。尽管其应用广泛,其隐私影响却鲜受关注;然而,该技术在实际中已被大量应用于隐私敏感场景。本研究首次对SMOTE中的隐私泄露问题进行了系统性探究:我们首先证明当前主流的评估方法(即简单判别法和最近邻距离度量)完全无法检测到任何泄露,而成员推理攻击(MIAs)却能够以高准确率实施。进而,通过利用SMOTE的几何特性,我们构建了两种仅需极弱假设的新型攻击:DistinSMOTE——可在增强数据集中完美区分真实记录与合成记录,以及ReconSMOTE——能够以完美精度重构真实少数类记录,且在现实不平衡比例下召回率趋近于1。我们同时为两种攻击提供了理论保证。在八个标准不平衡数据集上的实验验证了这些攻击的实用性与有效性。总体而言,我们的研究揭示了SMOTE本质上不具备隐私保护性,且会不成比例地暴露少数类记录,这凸显了在隐私敏感应用中重新审视其使用必要性的紧迫需求。