Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.
翻译:不同的私人深层学习最近见证了计算效率和隐私- 私利交易方面的进步。 我们探索了两个轴的进一步改进是否可行, 并提供肯定的答案, 利用两种快速的\ emph{ group- wise 剪切 } 。 为了减少计算私人学习的时间成本, 我们展示了 \ emph{ per- lay 剪切 }, 每一个神经网络层的梯度被分别剪切, 使得剪切工作可以与不同私利优化的后推法相配合进行。 这导致私人学习与许多兴趣流程的非私人学习一样具有记忆效率, 并且几乎每更新一次都与非私人学习一样快速。 使用固定阈值的每层剪切往往不完善标准平剪切, 使用适应性阈值的每层剪切或超过平板剪切, 从而在更短的时间里达到相似或更好的任务性能。 为了探索不同私利深度学习的平面模型的缩(预) 的限度, 我们私下微调化了175亿比GPT-3 3的非私人更新更新的更新。 我们绕了每层剪切操作的每台的平级的平级的平平级任务 。