Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers.
翻译:不同私人(DP)的学习在建立大型深层次的文本学习模式方面收效有限,而将差异性私私私私人软床底(DP-SGD)应用到NLP任务的直接尝试导致大量性能下降和高计算间接费用。我们表明,这一性能下降可以通过:(1) 使用大型预先培训的语言模式;(2) 适合DP优化的非标准超参数;以及(3) 与培训前程序相一致的微调目标来缓解。有了上述,我们获得了超越了先进DP培训模式的NLP模型,这些模型在相同的隐私预算和强大的非私人基线下,通过直接微调预先培训模式,在中等规模的复合体力上进行DP优化。为了应对使用大型变压器运行DP-SGD的计算挑战,我们建议了一种记忆保存技术,使DP-SGD的剪裁过程不为该模型中的任何直线层的每例梯度。这种技术使得个人培训的记忆变异模式几乎与非私人式的记忆成本相同,而非私人培训则在中等运行/高层次的节能模型上,从学习高层次的智能到高层次的变压的演示结果。