While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter. Co-locating multiple workers of a model is an effective way to maximize query-level parallelism and server throughput, but the interference caused by concurrent workers at shared resources can prevent server queries from meeting its SLA. Hera utilizes the heterogeneous memory requirement of multi-tenant recommendation models to intelligently determine a productive set of co-located models and its resource allocation, providing fast response time while achieving high throughput. We show that Hera achieves an average 37.3% improvement in effective machine utilization, enabling 26% reduction in required servers, significantly improving upon the baseline recommedation inference server.
翻译:虽然提供低延迟是部署建议服务的基本要求,但实现高资源效用对于以成本效益高的方式维护数据中心也至关重要。 将一个模型的多个工人合用同一地点是最大限度地实现查询级平行和服务器吞吐量的有效途径,但共享资源同时工作的工人造成的干扰可以阻止服务器查询达到其服务级协议。赫拉利用多租赁建议模型的多种记忆要求,明智地确定一套富有成效的合用同一地点模式及其资源分配,提供快速响应时间,同时实现高吞吐量。 我们显示赫拉在有效使用机器方面实现了平均37.3%的改进,使所需服务器减少26%,大大改进了基线重新组合推断服务器。