The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.
翻译:无服务器(FaaS)的吸引力促使人们日益关注如何在诸如ETL、查询处理或机器学习(ML)等数据密集型应用中使用它。在无服务器基础设施(例如AWS Lambda)之外,还存在一些系统用于培训大型ML模型(例如AWS Lambda),但是其性能和相对于“服务性”基础设施(IaaAS)的相对优势方面没有结论性结果。在本文中,我们对FaaS和IaaS的分布式ML培训进行了系统化的比较研究。我们提出了一个设计空间,覆盖了优化算法和同步协议等设计选择,并建立了一个平台(LambdaML),使FaS和IaaS能够进行公平的比较。我们用LambdaML提出实验结果,并进一步开发一个分析模型,在选择无服务器基础设施(Iaaa)时必须考虑成本/性交易。我们的结果表明,ML培训只有在具有高效(即减少)通信和快速结合的模型时,才能在服务器上得到回报。一般来说,FaS可以快得多地更廉价。