Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints for more than 99% of requests.
翻译:模型平行性通常被视为一种方法,用来在单一装置的内存限度以外扩大单一大型深层学习模式。 在本文中,我们证明,在为多个模型服务时,即使单一模型可以安装在一个单一装置中,也可以在多功能装置的统计多重化方面,进一步使用模型平行性。我们的工作表明,模型平行性带来的间接费用与利用统计多重化的机会之间的根本权衡,以便在出现突发工作量时减少长期性服务。我们探索新的交换空间,并推出一个新的服务系统AlpaServe,确定在分布式集群中放置大型深层学习模型并将其平行化的有效战略。对生产工作量的评价结果显示,AlpaServe在超过99%的请求中,可以处理多达10倍以上或6倍以上的要求。