Ensuring the privacy of training data is a growing concern since many machine learning models are trained on confidential and potentially sensitive data. Much attention has been devoted to methods for protecting individual privacy during analyses of large datasets. However in many settings, global properties of the dataset may also be sensitive (e.g., mortality rate in a hospital rather than presence of a particular patient in the dataset). In this work, we depart from individual privacy to initiate the study of attribute privacy, where a data owner is concerned about revealing sensitive properties of a whole dataset during analysis. We propose definitions to capture \emph{attribute privacy} in two relevant cases where global attributes may need to be protected: (1) properties of a specific dataset and (2) parameters of the underlying distribution from which dataset is sampled. We also provide two efficient mechanisms and one inefficient mechanism that satisfy attribute privacy for these settings. We base our results on a novel use of the Pufferfish framework to account for correlations across attributes in the data, thus addressing "the challenging problem of developing Pufferfish instantiations and algorithms for general aggregate secrets" that was left open by \cite{kifer2014pufferfish}.
翻译:由于许多机器学习模型都对保密和潜在敏感数据进行了培训,因此,确保培训数据隐私的问题日益受到关注,因为许多机器学习模型都对保密和潜在敏感数据进行了培训。在分析大型数据集的过程中,对保护个人隐私的方法给予了极大关注。然而,在许多环境下,数据集的全球特性也可能是敏感的(例如,医院的死亡率,而不是数据集中特定病人的发病率)。在这项工作中,我们从个人隐私出发,开始研究属性隐私,而数据拥有者担心在分析过程中披露整个数据集的敏感特性。我们提议在两个可能需要保护全球属性的相关案例中,对获取\emph{atritte隐私提出定义:(1)特定数据集的特性和(2)作为数据集样本的基本分布参数。我们还提供了两个高效机制和一个效率低的机制,满足这些环境的隐私属性。我们的结果基于对普费鱼类框架的新使用,以说明数据中各属性之间的相互关系,从而解决“开发普费鱼类即时和算法的难题,因为一般总的秘密是由\cite{pufferffer}所开放的。