In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible. Optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint -- a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptable periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.
翻译:在现代机器学习中,模型培训是一个迭代、实验性的过程,可以消耗巨大的计算资源和开发时间。 为了帮助这一过程, 经验丰富的模型开发者可以在培训运行期间进行记录和可视化程序变量。 对所有变量的精密记录是行不通的。 优化记录可以由程序检查站伴随; 这样可以让开发者在检查站添加日志报表后热量和“ 重现” 想要的日志声明 -- -- 我们称之为后视记录。 不幸的是, 后视记录在数据管理和软件工程方面引起了棘手的问题。 完成错误、 后视记录可以浪费资源, 并在培训代码的多个变异版本中生成技术债务。 在本文中, 我们提出高效和高效的伐木方法, 重点是后观记录记录。 我们的目标是让经验丰富的模型开发者学习和采用这些做法。 为了方便, 我们为快速的低头回收(fllorororlor)提供了一套工具的开源包, 将我们的设计体现于三个基准中。 (i) 在Python上高效的背景记录, (ii) 定期的定期浏览浏览记录中, 以及(iii) 自动记录自动记录到自动记录到自动记录工具, 可以显示一个自动记录到自动记录, 工具的自动记录到自动记录, 。