Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.
翻译:找到最佳 VM 配置是降低成本和更高的输送量的关键,这是今天基于云的分布神经网络培训中的两个主要问题。满足用户限制的优化 VM 选择需要高效地浏览大型搜索空间,同时控制与共享云样和网络相关的性能差异。在这项工作中,我们在分布式的NN 培训中说明了这一差异,并介绍了我们为优化 VM 搜索空间进行的一系列综合输送量和成本效益研究的结果。我们利用这些研究的洞察力,建立了Srifty系统,该系统将运行时间分析与学习的性能模型结合起来,以准确预测培训业绩,找到满足用户限制的最佳 VM 选择,可能同时利用多种设置和现场实例。我们把Srifty与PyTorrch 整合在一起,并在亚马孙EC2 上对2个以上的Srifty培训设置进行了大规模综合化研究。我们的结果显示,Srifty 实现了8%的循环拉特预测错误, 其 VM 实例建议提供了大量通过生产量增长和成本的情景,同时满足现实世界范围内的现有解决方案。