Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performance of the best method in the class. We consider the central problem of linear regression-with a random isotropic ground truth-and investigate the estimation performance of two fundamental methods, gradient descent and ridge regression. We unveil the following phenomena. (1) For general designs, constant stepsize gradient descent outperforms ridge regression when the eigenvalues of the empirical data covariance matrix decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we show that ridge regression outperforms gradient descent. (2) For orthogonal designs, we compute the exact minimax optimal class of estimators (achieving min-max-min optimality), showing it is equivalent to gradient descent with decaying learning rate. We find the sub-optimality of ridge regression and gradient descent with constant step size. Our results highlight that statistical performance can depend strongly on tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our setting, it can be outperformed by gradient descent by an arbitrary/unbounded amount when both methods are only tuned over finitely many regularization parameters.
翻译:从数据中学习的方法取决于不同类型的调试参数,例如惩罚力或步数大小。由于性能可以在很大程度上依赖这些参数,因此有必要比较测深器的等级,考虑到规定的调适参数的有限组合,而不仅仅是特别调整的方法。在这项工作中,我们通过该类中最佳方法的相对性能来调查方法的类别。我们考虑线性回归的核心问题,使用随机的等温地面真理,并调查两种基本方法(梯度下沉和峰值回归)的估计性能。我们公布以下现象:(1)对于一般设计来说,当实验性数据变差变梯度的梯度超过梯度梯度的梯度回归值缓慢衰减时,有必要将测测测测测的梯度梯度值递减为梯度的梯度值。如果以比优的推力法快速衰减,则显示峰值的回归率高于梯度,在最优的梯度下沉度中,我们只能根据最优的梯度调整精确的下沉度,在最优的梯度下沉度中,我们最优的梯度下沉度的梯度将显示我们最优的梯度。