In today's production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadlines and budget limits. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)
翻译:在当今的生产机学习系统中,模型不断得到培训、改进和部署。ML设计和培训正在成为具有动态资源需求的各种任务的持续工作流程。无服务器计算是一个新兴的云型模式,为用户提供了透明的资源管理和规模,并有可能使ML设计和培训的常规发生革命性变革。然而,将现代ML工作流程存放在现有的无服务器平台上,由于其内在设计局限性,如无国籍性质、各功能的现代通信支持有限、以及功能执行期限有限等,具有非三角性的挑战。这些限制导致缺乏培训动态的总体视野和适应机制,以及扩大ML工作流程中的现有问题。为了应对上述挑战,我们建议SMLTT,一个自动化、可扩展和适应的无服务器框架,以便高效和以用户为中心的设计和培训。 SMLLLT在培训中采用一个自动和适应性安排机制,以动态优化MLF任务的部署和资源规模。 SMLT进一步使以用户为中心的ML工作流程得以实施,通过支持用户指定的培训期限和预算限制。此外,我们为MLLLT的初始培训平台提供一个以内向内端的软化的机能化模型,为MLTF提供基于的机能的机能,为MLTLTLT的机能的机能,为MLT的系统提供基于的机能的机能的机能的机能,为MLTLTLTF的机能的机的机的机的机的机能的机能为M。