Srift: 斯威夫特和特里夫特云云分布培训 (Srift: Swift and Thrift Cloud-Based Distributed Training)

Cost-efficiency and training time are primary concerns in cloud-based distributed training today. With many VM configurations to choose from, given a time constraint, what configuration achieves the lowest cost? Or, given a cost budget, which configuration leads to the highest throughput? We present a comprehensive throughput and cost-efficiency study across a wide array of instance choices in the cloud. With the insights from this study, we build Srift, a system that combines runtime instrumentation and learned performance models to accurately predict training performance and find the best choice of VMs to improve throughput and lower cost while satisfying user constraints. With Pytorch and EC2, we show Srift's choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances.

翻译：成本效率和培训时间是今天基于云的分布式培训的首要问题。许多 VM 配置在时间限制下从哪些配置可以实现最低成本? 或者, 在成本预算下, 哪种配置可以导致最高输送量? 我们展示了对云层中各种实例选择的全面输送量和成本效率研究。我们从这项研究的洞察力出发, 构建了Srift, 该系统将运行时间仪器和学习性能模型结合起来, 以准确预测培训绩效, 并找到 VM 的最佳选择, 在满足用户限制的同时, 改进吞吐量和降低成本。在 Pytorch 和 EC2 中, 我们展示了 Srift 对 VM 实例的选择, 与现实世界中各种 DNN 模型的基线选择相比, 我们利用了多种设置和现场实例, 能够导致 2x 更好的输送量和 1.6x 的每次循环成本。

相关内容

Thrift

关注 0

thrift是一个软件框架，用来进行可扩展且跨语言的服务的开发。它结合了功能强大的软件堆栈和代码生成引擎，以构建在 C++, Java, Go,Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml 这些编程语言间无缝结合的、高效的服务。

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【KDD2020-清华大学】自适应图编码器，Adaptive Graph Encoder for Attributed Graph Embedding

专知会员服务

99+阅读 · 2020年7月6日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日