Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and attempts at straightforwardly applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained models; (2) hyperparameters that suit DP optimization; and (3) fine-tuning objectives aligned with the pretraining procedure. With these factors set right, we obtain private NLP models that outperform state-of-the-art private training approaches and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained models tends to not suffer from dimension-dependent performance degradation.
翻译:差异化私人(DP)学习在建立大型深层次的文本学习模式方面收效有限,试图直接将差异性私自私自软骨架(DP-SGD)应用到NLP任务的努力导致大量性能下降和高计算间接费用。我们表明,这一性能下降可以通过:(1) 使用大型预先培训模式;(2) 适合DP优化的超强参数;和(3) 与培训前程序一致的微调目标来缓解。有了这些因素,我们获得了私人的NLP模型,这些模型超过了最先进的私人培训方法和强大的非私人基线 -- -- 通过直接微调预先培训模型,在中度公司优化了DP优化。为了应对使用大型变压器运行DP-SGD的计算挑战,我们提议了一种记忆保存技术,使DP-SGD的剪切工作不为模式中任何层的每振动梯度。这种技术使私人培训改造的记忆成本与非私人培训几乎相同,在适度的运行时空顶部进行非私人培训。与传统智慧相反,在高度磁度上,在高尺度上没有学习模式,而没有经验化的硬度的硬度模型。