In deep learning with differential privacy (DP), the neural network achieves the privacy usually at the cost of slower convergence (and thus lower performance) than its non-private counterpart. This work gives the first convergence analysis of the DP deep learning, through the lens of training dynamics and the neural tangent kernel (NTK). Our convergence theory successfully characterizes the effects of two key components in the DP training: the per-sample clipping and the noise addition. Our analysis not only initiates a general principled framework to understand the DP deep learning with any network architecture and loss function, but also motivates a new clipping method -- the global clipping, that significantly improves the convergence, as well as preserves the same DP guarantee and computational efficiency as the existing method, which we term as local clipping. Theoretically speaking, we precisely characterize the effect of per-sample clipping on the NTK matrix and show that the noise level of DP optimizers does not affect the convergence in the gradient flow regime. In particular, the local clipping almost certainly breaks the positive semi-definiteness of NTK, which can be preserved by our global clipping. Consequently, DP gradient descent (GD) with global clipping converge monotonically to zero loss, which is often violated by the existing DP-GD. Notably, our analysis framework easily extends to other optimizers, e.g., DP-Adam. We demonstrate through numerous experiments that DP optimizers equipped with global clipping perform strongly on classification and regression tasks. In addition, our global clipping is surprisingly effective at learning calibrated classifiers, in contrast to the existing DP classifiers which are oftentimes over-confident and unreliable. Implementation-wise, the new clipping can be realized by inserting one line of code into the Pytorch Opacus library.
翻译:在以不同的隐私深度学习(DP)中,神经网络实现了隐私,其成本通常比非私人网络更低(因此性能更低),其成本通常比非私人网络更低。这项工作通过培训动态和神经相调内核(NTK)的透镜,首次对DP深层学习进行了趋同分析。我们的趋同理论成功地描述了DP培训中两个关键组成部分的影响:每个样本剪报和添加噪音。我们的分析不仅启动了一个普遍的原则框架,以方便理解DP与任何网络架构和损失功能的深度学习,而且还启动了一个新的剪报方法 -- -- 全球剪报,大大改进了趋同,并保留了与现有方法相同的DP保证和计算效率,我们称之为本地剪辑。理论上说,我们精确地描述了DP培训中两个关键组成部分的影响:每个样本剪辑剪辑和添加噪音水平不会影响梯度流动机制的趋同。特别是,地方剪贴几乎肯定地打破了NTK的正下行半尾曲线,这大大改进了全球的趋同时间性任务,而现在的更精确地显示的是,从全球的变校正的变校正到现在的DD的变的变的变。