Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer - improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate, a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit.
翻译:内存效率对于培训关于资源限制装置的深学习网络至关重要。 在后方调整过程中, 使用前方电压器来计算梯度。 尽管在后方调整中可以选择将这些依赖性保留在内存中, 直到重新使用, 一些前方电压器可以丢弃, 稍后再从已保存的高压器( 所谓的检查站) 中重新配置。 这特别允许资源限制的多元环境使用所有可用的计算设备。 不幸的是, 这些检查点的定义是一个非三重问题, 并且对程序员构成挑战 - 不适当或过量的重新计算否定了检查的好处。 在此文章中, 我们展示 XEngine, 一种让网络操作员在低层存储环境中安排不同设备。 我们的方法是按时间段和操作员选择合适的资源, 优化神经网络的端对端端到端时间, 将每个设备的内存限制考虑在内。 为此, 我们设置了一个混合内置的二次二次配置程序程序程序( MIQP), 将一个用于在内端系统进行深度学习网络操作操作的操作者( 仅限为 Central 的系统, 我们的内存系统对内存系统进行一次的内存时间列表进行中, 检查 MIQL