ByteScheduler partitions and rearranges tensor transmissions to improve the communication efficiency of distributed Deep Neural Network (DNN) training. The configuration of hyper-parameters (i.e., the partition size and the credit size) is critical to the effectiveness of partitioning and rearrangement. Currently, ByteScheduler adopts Bayesian Optimization (BO) to find the optimal configuration for the hyper-parameters beforehand. In practice, however, various runtime factors (e.g., worker node status and network conditions) change over time, making the statically-determined one-shot configuration result suboptimal for real-world DNN training. To address this problem, we present a real-time configuration method (called AutoByte) that automatically and timely searches the optimal hyper-parameters as the training systems dynamically change. AutoByte extends the ByteScheduler framework with a meta-network, which takes the system's runtime statistics as its input and outputs predictions for speedups under specific configurations. Evaluation results on various DNN models show that AutoByte can dynamically tune the hyper-parameters with low resource usage, and deliver up to 33.2\% higher performance than the best static configuration in ByteScheduler.
翻译:ByteScheduler采用ByteScheduler预先找到超参数的最佳配置。但是,在实践中,各种运行时间因素(例如,工人节点状态和网络条件)随着时间的变化而变化,使得实时 DNN培训的静态一发配置结果亚优度。为了解决这个问题,我们提出了一个实时配置方法(称为AutoByte),即实时配置方法(称为AutoByte)可以自动和及时地搜索最佳超参数,作为培训系统动态的变化。AutoByte用一个元网络扩展了ByteScheduler框架,将系统运行时间统计作为具体配置下速度的预测和输入。DNNM2的各种模型的估值结果显示的是自动同步和最高级的性能,而自动同步的模型显示的是比自动同步高的性能。