Cost-efficiency and training time are primary concerns in cloud-based distributed training today. With many VM configurations to choose from, given a time constraint, what configuration achieves the lowest cost? Or, given a cost budget, which configuration leads to the highest throughput? We present a comprehensive throughput and cost-efficiency study across a wide array of instance choices in the cloud. With the insights from this study, we build Srift, a system that combines runtime instrumentation and learned performance models to accurately predict training performance and find the best choice of VMs to improve throughput and lower cost while satisfying user constraints. With Pytorch and EC2, we show Srift's choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances.
翻译:成本效率和培训时间是今天基于云的分布式培训的首要问题。 许多 VM 配置在时间限制下从哪些配置可以实现最低成本? 或者, 在成本预算下, 哪种配置可以导致最高输送量? 我们展示了对云层中各种实例选择的全面输送量和成本效率研究。 我们从这项研究的洞察力出发, 构建了Srift, 该系统将运行时间仪器和学习性能模型结合起来, 以准确预测培训绩效, 并找到 VM 的最佳选择, 在满足用户限制的同时, 改进吞吐量和降低成本。 在 Pytorch 和 EC2 中, 我们展示了 Srift 对 VM 实例的选择, 与现实世界中各种 DNN 模型的基线选择相比, 我们利用了多种设置和现场实例, 能够导致 2x 更好的输送量和 1.6x 的每次循环成本。