When engineers train deep learning models, they are very much 'flying blind'. Commonly used methods for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network's training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a collection of instruments that enable a closer look into the inner workings of a learning machine, and a more informative and meaningful status report for practitioners. It facilitates the identification of learning phases and failure modes, like ill-chosen hyperparameters. These instruments leverage novel higher-order information about the gradient distribution and curvature, which has only recently become efficiently accessible. We believe that such a debugging tool, which we open-source for PyTorch, is a valuable help in troubleshooting the training process. By revealing new insights, it also more generally contributes to explainability and interpretability of deep nets.
翻译:当工程师训练深层学习模型时,它们非常“飞行失明 ” 。 用于实时培训诊断的常用方法,例如监测火车/测试损失,是有限的。 仅仅通过这些业绩指标来评估网络的培训过程,类似于在无法通过调试器进入内部状态的情况下对软件进行调试。 为了解决这个问题,我们提供了“驾驶舱”,这是一套能够更仔细地查看学习机器内部运行的仪器,也是一份更丰富和有意义的实践者状况报告。它有助于识别学习阶段和失败模式,如坏掉的超参数。这些仪器利用关于梯度分布和弯曲的新的较高顺序信息,这种信息最近才有效获得。 我们认为,这种调试工具,我们为皮托尔奇打开了源头,对于解决培训过程的麻烦很有帮助。 通过揭示新的洞察力,它更普遍地有助于深网的可解释性和可解释性。