Deep learning-based personalized recommendation systems are widely used for online user-facing services in production datacenters, where a large amount of hardware resources are procured and managed to reliably provide low-latency services without disruption. As the recommendation models continue to evolve and grow in size, our analysis projects that datacenters deployed with monolithic servers will spend up to 12.4x total cost of ownership (TCO) to meet the requirement of model size and complexity over the next three years. Moreover, through in-depth characterization, we reveal that the monolithic server-based cluster suffers resource idleness and wastes up to 30% TCO by provisioning resources in fixed proportions. To address this challenge, we propose DisaggRec, a disaggregated system for large-scale recommendation serving. DisaggRec achieves the independent decoupled scaling-out of the compute and memory resources to match the changing demands from fast-evolving workloads. It also improves system reliability by segregating the failures of compute nodes and memory nodes. These two main benefits from disaggregation collectively reduce the TCO by up to 49.3%. Furthermore, disaggregation enables flexible and agile provisioning of increasing hardware heterogeneity in future datacenters. By deploying new hardware featuring near-memory processing capability, our evaluation shows that the disaggregated cluster achieves 21%-43.6% TCO savings over the monolithic server-based cluster across a three-year span of model evolution.
翻译:深入学习的个人化建议系统被广泛用于生产数据中心的在线用户定位服务,在那里采购和管理了大量硬件资源,以可靠地提供低长度服务而不受干扰。随着建议模式继续演化和规模扩大,我们的分析项目是,配置单一服务器的数据中心将花费高达12.4x总所有权成本(TCO),以满足未来三年内快速变化的工作量和复杂性需求的变化。此外,通过深入的描述,我们发现基于单一式跨式服务器的集群通过提供固定比例的资源,使资源闲置和浪费高达30% TCO。为了应对这一挑战,我们提议DisaggRec,这是一个用于大规模建议的分类系统。DisaggRec实现了独立拆分扩大计算和记忆资源以适应未来三年内快速变化的工作量的变化需求。它还通过将计算节点和记忆节点的失败进行分类,提高了系统可靠性。通过以固定比例提供资源,我们用近49.3的比例对TRCO进行分类,这两大主要好处是:通过近49.3 %的分类,用于大规模的建议服务。DisagRecalRec 实现了独立拆分解计算和记忆资源,从而将硬性地进行新的硬件分类。