In many machine learning applications, the training data can contain highly sensitive personal information. Training large-scale deep models that are guaranteed not to leak sensitive information while not compromising their accuracy has been a significant challenge. In this work, we study the multi-class classification setting where the labels are considered sensitive and ought to be protected. We propose a new algorithm for training deep neural networks with label differential privacy, and run evaluations on several datasets. For Fashion MNIST and CIFAR-10, we demonstrate that our algorithm achieves significantly higher accuracy than the state-of-the-art, and in some regimes comes close to the non-private baselines. We also provide non-trivial training results for the the challenging CIFAR-100 dataset. We complement our algorithm with theoretical findings showing that in the setting of convex empirical risk minimization, the sample complexity of training with label differential privacy is dimension-independent, which is in contrast to vanilla differential privacy.
翻译:在许多机器学习应用中,培训数据可以包含高度敏感的个人信息。培训大型深层模型,保证不会泄露敏感信息,同时又不会损害其准确性,这是一项重大挑战。在这项工作中,我们研究多级分类设置,其中标签被认为是敏感和应当受到保护的。我们提出一种新的算法,用于培训带有标签差异隐私的深神经网络,并对若干数据集进行评估。对于时装MNIST和CIFAR-10,我们证明我们的算法的准确性远远高于最新技术,在某些制度中,我们的算法接近非私人基线。我们还为具有挑战性的CIFAR-100数据集提供非三轨培训结果。我们用理论结论来补充我们的算法,表明在确定锥形实验风险最小化时,与标签差异隐私培训的样本复杂性是独立的,这与香草差异隐私不同。