Omnivore: CPU 和 GPU 多设备深层学习优化器 (Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs)

We study the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, our goal is to minimize the time to train this model on a cluster of commodity CPUs and GPUs. We first focus on the single-node setting and show that by using standard batching and data-parallel techniques, throughput can be improved by at least 5.5x over state-of-the-art systems on CPUs. This ensures an end-to-end training speed directly proportional to the throughput of a device regardless of its underlying hardware, allowing each node in the cluster to be treated as a black box. Our second contribution is a theoretical and empirical study of the tradeoffs affecting end-to-end training time in a multiple-device setting. We identify the degree of asynchronous parallelization as a key factor affecting both hardware and statistical efficiency. We see that asynchrony can be viewed as introducing a momentum term. Our results imply that tuning momentum is critical in asynchronous parallel configurations, and suggest that published results that have not been fully tuned might report suboptimal performance for some configurations. For our third contribution, we use our novel understanding of the interaction between system and optimization dynamics to provide an efficient hyperparameter optimizer. Our optimizer involves a predictive model for the total time to convergence and selects an allocation of resources to minimize that time. We demonstrate that the most popular distributed deep learning systems fall within our tradeoff space, but do not optimize within the space. By doing this optimization, our prototype runs 1.9x to 12x faster than the fastest state-of-the-art systems.

翻译：我们研究影响多设备深层学习系统中培训时间的因素。根据进化神经网络的规格, 我们的目标是最大限度地减少在商品 CPU 和 GPU 组群中培训这一模型的时间。我们首先侧重于单节设置, 并显示通过使用标准批量和数据平行技术, 可以通过对CPU上最新系统进行至少5. 5x 的同步平行改进。这可以确保一个设备( 不论其内在硬件)的超速至终端培训速度, 使集中每个节点都被当作黑盒。我们的第二个贡献是理论和经验性研究, 对影响端到端到端培训时间的折叠加进行折叠加。我们发现, 通过使用标准批次同步和数据平行技术, 可以通过对最先进的系统进行至少5. 5x 来改进。我们发现, 亚齐点可以被视为引入一个动态术语。我们的结果意味着, 调动动力在同步配置中是关键, 使组合中的每个节点被作为黑盒处理。我们的第三个公布的结果是, 对影响端对端到端到最优化的节点的节点的节流空间配置进行完全的节算。我们的系统内部的节算, 我们的节算, 我们的节略的节算, 我们的节算的节算的节能系统中, 可能提供我们对节能的节能的节能的节算的节能的节能的节能的节能的节能的节能使用我们我们的节制系统运行的节能的节能的节能的节能的节制系统运行中。