Deep learning has achieved many breakthroughs in modern classification tasks. Numerous architectures have been proposed for different data structures but when it comes to the loss function, the cross-entropy loss is the predominant choice. Recently, several alternative losses have seen revived interests for deep classifiers. In particular, empirical evidence seems to promote square loss but a theoretical justification is still lacking. In this work, we contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks in the neural tangent kernel (NTK) regime. Interesting properties regarding the generalization error, robustness, and calibration error are revealed. We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error. When classes are separable, the misclassification rate improves to be exponentially fast. Further, the resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness. We expect our findings to hold beyond the NTK regime and translate to practical settings. To this end, we conduct extensive empirical studies on practical neural networks, demonstrating the effectiveness of square loss in both synthetic low-dimensional data and real image data. Comparing to cross-entropy, square loss has comparable generalization error but noticeable advantages in robustness and model calibration.
翻译:在现代分类任务中,我们取得了许多深刻的学习突破。许多结构建议了不同的数据结构,但当涉及到损失功能时,跨热带损失是主要的选择。最近,一些其他损失发现深海分类者的兴趣恢复了。特别是,经验证据似乎促进平方损失,但理论理由仍然缺乏。在这项工作中,我们通过系统调查其在神经核内核(NTK)系统中过度平衡的神经网络的运行情况,促进对平方损失的理论理解。关于一般化错误、稳健和校准错误的有趣属性被揭示出来。我们考虑的是两个案例,即类是否分离。在一般非分离的情况下,对错误分类率和校准错误都建立了快速的趋同率。在课堂分解时,错误分类率的改善速度会非常快。此外,由此得出的差差比从零开始缩小,为稳健提供了理论保证。我们期望我们的调查结果将超出NTK制度的范围,而转化为实际化的校准错误。在一般非分离的情况下,我们通过广泛的实验性研究,在实际的模型中,将实际性损失的模型变为可比较性模型。