We study differentially private (DP) stochastic optimization (SO) with data containing outliers and loss functions that are (possibly) not Lipschitz continuous. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it is often unrealistic: in many practical problems, the loss function may not be uniformly Lipschitz. Even when the loss function is Lipschitz continuous, the worst-case Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work does not require the loss function to be uniformly Lipschitz. Instead, building on a recent line of work [WXDX20, KLZ22], we make the weaker assumption that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on DP Lipschitz SO, our excess risk scales with the $k$-th moment bound instead of the Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to the prior works, our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk.
翻译:我们用包含异常值和损失函数的数据来研究差异私人(DP)随机优化(SO),这些数据(可能)不是Lipsopitz的连续性。迄今为止,DP SO的绝大多数工作都假设损失与数据是统一的Lipschitz(即,Stochistic 梯度是在所有数据点上统一捆绑的 ) 。虽然这一假设很方便,但它往往不切实际:在许多实际问题中,损失功能可能不是统一的Lipschitz。即使损失函数是Lipschitz的持续,在所有数据点上损失的最差的利普西茨参数可能非常大。在这种情况下,DP SOO的错误界限与最差的Lipschitz参数相比是空虚的。为了解决这些局限性,这项工作并不要求损失功能一致。相反,在最新的工作线上[WXDX20,KLLZ22],我们较弱的假设是最差的梯度梯度梯度梯度梯度,在所有数据点上,最差的利普里特尔-变变变变变变的值值值值值并非美元;在Slickslex的机变值之前的机损失等级值中,我们最差的机的机变速损失的机变值是比值。