We study the generalization properties of unregularized gradient methods applied to separable linear classification -- a setting that has received considerable attention since the pioneering work of Soudry et al. (2018). We establish tight upper and lower (population) risk bounds for gradient descent in this setting, for any smooth loss function, expressed in terms of its tail decay rate. Our bounds take the form $\Theta(r_{\ell,T}^2 / \gamma^2 T + r_{\ell,T}^2 / \gamma^2 n)$, where $T$ is the number of gradient steps, $n$ is size of the training set, $\gamma$ is the data margin, and $r_{\ell,T}$ is a complexity term that depends on the (tail decay rate) of the loss function (and on $T$). Our upper bound matches the best known upper bounds due to Shamir (2021); Schliserman and Koren (2022), while extending their applicability to virtually any smooth loss function and relaxing technical assumptions they impose. Our risk lower bounds are the first in this context and establish the tightness of our upper bounds for any given tail decay rate and in all parameter regimes. The proof technique used to show these results is also markedly simpler compared to previous work, and is straightforward to extend to other gradient methods; we illustrate this by providing analogous results for Stochastic Gradient Descent.
翻译:我们研究了用于分解线性分类的不正规梯度方法的概括性特性 -- -- 自苏德里等人的开创性工作(2018年)以来,这一设置一直受到相当的重视。我们在这一设置中为梯度下降设置了紧紧的上下(人口)风险界限,对于任何顺滑损失功能,其表现为尾尾部衰减率。我们的界限是:$theta(r<unk> ell,T<unk> 2/\gamma2T+r<unk> ell,T<unk> 2/\gamma2n) T+r<unk> ell,T<unk> 2/gamma2n)美元,其中,美元是梯度步骤的数目,美元是培训的大小,美元是培训的大小,美元(人口)是在这个设置中为梯度,美元(人口)是一个复杂的条件,取决于损失功能(尾部衰减率)的(和美元)。我们的上边框与Shamir(2021年);Schlibererman和Koren(2022年)最知名的上限,同时将其适用到任何平稳损失功能,放松的技术假设。我们的风险下限是这个背景中的第一个背景的底线是这个背景的缩缩的缩,我们用来展示的缩和直径直线,用来说明。</s>