This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any univariate classifier satisfying a local interpolation property is inconsistent.
翻译:这项工作研究通过二元分类数据中的梯度梯度下降进行后勤损失培训的浅ReLU网络的行为,这些网络的基本数据分布为一般数据,(最理想的)贝雅人风险不一定为零。 在这种背景下,通过早期停止的梯度下降不仅在后勤损失和分类错误损失方面,而且在校准方面,都可任意接近最佳的人口风险,也就是说,其产出的类组图与真实的有条件分布相近。此外,这一分析所有规模的必要迭代、抽样和建筑复杂性自然都与真正有条件模型的某种复杂度相适应。最后,虽然没有证据表明早期停止是必要的,但可以证明任何满足本地内插特性的单向分类师是不一致的。