大型语言模式可以成为强有力的差异式私人学习者 (Large Language Models Can Be Strong Differentially Private Learners)

Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and attempts at straightforwardly applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained models; (2) hyperparameters that suit DP optimization; and (3) fine-tuning objectives aligned with the pretraining procedure. With these factors set right, we obtain private NLP models that outperform state-of-the-art private training approaches and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained models tends to not suffer from dimension-dependent performance degradation.

翻译：差异化私人(DP)学习在建立大型深层次的文本学习模式方面收效有限,试图直接将差异性私自私自软骨架(DP-SGD)应用到NLP任务的努力导致大量性能下降和高计算间接费用。我们表明,这一性能下降可以通过:(1) 使用大型预先培训模式;(2) 适合DP优化的超强参数;和(3) 与培训前程序一致的微调目标来缓解。有了这些因素,我们获得了私人的NLP模型,这些模型超过了最先进的私人培训方法和强大的非私人基线 -- -- 通过直接微调预先培训模型,在中度公司优化了DP优化。为了应对使用大型变压器运行DP-SGD的计算挑战,我们提议了一种记忆保存技术,使DP-SGD的剪切工作不为模式中任何层的每振动梯度。这种技术使私人培训改造的记忆成本与非私人培训几乎相同,在适度的运行时空顶部进行非私人培训。与传统智慧相反,在高度磁度上,在高尺度上没有学习模式,而没有经验化的硬度的硬度模型。