Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.
翻译:用于分类任务的现代神经结构是使用跨天体损失进行训练的,普遍认为这种损失在经验上优于平方损失。在这项工作中,我们提供证据表明,这种信念可能没有根据。我们探索了几个主要神经结构和一系列NLP的标准基准数据集、自动语音识别(ASR)和计算机愿景任务,以表明这些结构与文献所报告的相同超光度设置,在与平方损失培训时,即使经过平方计算资源的平衡,其表现也相当或更好。事实上,我们看到平方损失在绝大多数NLP和ASR实验中产生更好的结果。跨天体似乎在计算机视觉任务上略有优势。我们争辩说,几乎没有令人信服的经验或理论证据表明交叉天体损失具有明显的优势。事实上,在我们实验中,几乎所有非视标任务的业绩都可以通过转换到平方损失来改进,有时是显著的。此外,有平方损失的培训似乎对初始的随机性不那么敏感。我们认为,使用最深层次的平方位损失分类方法的现代培训需要学习最深层次的平方位损失。