In Machine Learning (ML) system research, efficient resource scheduling and utilization have always been an important topic given the compute-intensive nature of ML applications. In this paper, we introduce the design of TACC, a full-stack cloud infrastructure that efficiently manages and executes large-scale machine learning applications in compute clusters. TACC implements a 4-layer application workflow abstraction through which system optimization techniques can be dynamically combined and applied to various types of ML applications. TACC also tailors to the lifecycle of ML applications with an efficient process of managing, deploying, and scaling ML tasks. TACC's design simplifies the process of integrating the latest ML system research work into cloud infrastructures, which we hope will benefit more ML researchers and promote ML system researches.
翻译:在机器学习(ML)系统研究中,考虑到ML应用的计算密集性质,高效的资源时间安排和利用一直是一个重要的专题,在本文中,我们介绍了TACC的设计,这是一个全堆云基础设施,能有效地管理和实施大型机器学习应用程序以计算群集;TACC采用四层应用工作流程抽象,通过这种抽象,系统优化技术可以动态地结合并应用于各种ML应用;TACC还针对ML应用的生命周期,采用有效的管理、部署和扩展ML任务的过程。TACC的设计简化了将最新的ML系统研究工作纳入云层基础设施的进程,我们希望这将使更多的ML研究人员受益,并促进ML系统研究。