Parameter management is essential for distributed training of large machine learning (ML) tasks. Some ML tasks are hard to distribute because common approaches to parameter management can be highly inefficient. Advanced parameter management approaches -- such as selective replication or dynamic parameter allocation -- can improve efficiency, but to do so, they typically need to be integrated manually into each task's implementation and they require expensive upfront experimentation to tune correctly. In this work, we explore whether these two problems can be avoided. We first propose a novel intent signaling mechanism that integrates naturally into existing ML stacks and provides the parameter manager with crucial information about parameter accesses. We then describe AdaPM, a fully adaptive, zero-tuning parameter manager based on this mechanism. In contrast to prior systems, this approach separates providing information (simple, done by the task) from exploiting it effectively (hard, done automatically by AdaPM). In our experimental evaluation, AdaPM matched or outperformed state-of-the-art parameter managers out of the box, suggesting that automatic parameter management is possible.
翻译:参数管理对于大型机器学习(ML)任务的分配培训至关重要。 某些 ML 任务很难分配, 因为通用的参数管理方法可能效率极低。 高级参数管理方法, 如选择性复制或动态参数分配, 可以提高效率, 但是要做到这一点, 它们通常需要手工整合到每项任务的执行中, 并且需要昂贵的前沿实验才能正确调和。 在这项工作中, 我们探讨这两个问题是否可以避免。 我们首先提议一个新的意图信号机制, 将它自然地整合到现有的 ML 堆中, 并为参数管理者提供有关参数访问的关键信息 。 我们然后描述 AdaPM, 一个基于此机制的完全适应性的零调试参数管理器。 与先前的系统不同, 这种方法将提供信息( 简单化的, 由任务完成的) 和有效利用信息( 硬化的, 由 AdaPM 自动完成) 分开。 在我们的实验评价中, AdaPM 匹配或超越了状态的参数管理器, 表明自动参数管理是可能的 。</s>