示范培训的 " 后见 " 测算 (Hindsight Logging for Model Training)

In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible. Optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint -- a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptable periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.

翻译：在现代机器学习中,模型培训是一个迭代、实验性的过程,可以消耗巨大的计算资源和开发时间。为了帮助这一过程, 经验丰富的模型开发者可以在培训运行期间进行记录和可视化程序变量。对所有变量的精密记录是行不通的。优化记录可以由程序检查站伴随; 这样可以让开发者在检查站添加日志报表后热量和“ 重现” 想要的日志声明 -- -- 我们称之为后视记录。不幸的是, 后视记录在数据管理和软件工程方面引起了棘手的问题。完成错误、后视记录可以浪费资源, 并在培训代码的多个变异版本中生成技术债务。在本文中, 我们提出高效和高效的伐木方法, 重点是后观记录记录。我们的目标是让经验丰富的模型开发者学习和采用这些做法。为了方便, 我们为快速的低头回收(fllorororlor)提供了一套工具的开源包, 将我们的设计体现于三个基准中。 (i) 在Python上高效的背景记录, (ii) 定期的定期浏览浏览记录中, 以及(iii) 自动记录自动记录到自动记录到自动记录工具, 可以显示一个自动记录到自动记录, 工具的自动记录到自动记录, 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/