When engineers train deep learning models, they are very much "flying blind". Commonly used approaches for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network's training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a collection of instruments that enable a closer look into the inner workings of a learning machine, and a more informative and meaningful status report for practitioners. It facilitates the identification of learning phases and failure modes, like ill-chosen hyperparameters. These instruments leverage novel higher-order information about the gradient distribution and curvature, which has only recently become efficiently accessible. We believe that such a debugging tool, which we open-source for PyTorch, represents an important step to improve troubleshooting the training process, reveal new insights, and help develop novel methods and heuristics.
翻译:当工程师们训练深层次学习模式时,他们非常“盲目地飞行”。通常使用的实时培训诊断方法,例如监测火车/测试损失,是有限的。仅仅通过这些业绩指标来评估网络的培训过程,类似于在无法通过调试器进入内部状态的情况下对软件进行调试。为了解决这个问题,我们提供“驾驶舱”,这是一套工具,能够更仔细地查看学习机器的内部工作,并且为实践者提供一份更加丰富和有意义的状况报告。它有助于识别学习阶段和失败模式,如坏掉的超参数。这些仪器利用关于梯度分布和曲线的新的更高层次信息,最近才有效获得这些信息。 我们认为,这种调试工具是改进训练过程的故障排除、揭示新洞察力和帮助开发新方法和超理论的一个重要步骤。