超大型数据中心合用同一地点 (Interference and Need Aware Workload Colocation in Hyperscale Datacenters)

Datacenters suffer from resource utilization inefficiencies due to the conflicting goals of service owners and platform providers. Service owners intending to maintain Service Level Objectives (SLO) for themselves typically request a conservative amount of resources. Platform providers want to increase operational efficiency to reduce capital and operating costs. Achieving both operational efficiency and SLO for individual services at the same time is challenging due to the diversity in service workload characteristics, resource usage patterns that are dependent on input load, heterogeneity in platform, memory, I/O, and network architecture, and resource bundling. This paper presents a tunable approach to resource allocation that accounts for both dynamic service resource needs and platform heterogeneity. In addition, an online K-Means-based service classification method is used in conjunction with an offline sensitivity component. Our tunable approach allows trading resource utilization efficiency for absolute SLO guarantees based on the service owners' sensitivity to its SLO. We evaluate our tunable resource allocator at scale in a private cloud environment with mostly latency-critical workloads. When tuning for operational efficiency, we demonstrate up to ~50% reduction in required machines; ~40% reduction in Total-Cost-of-Ownership (TCO); and ~60% reduction in CPU and memory fragmentation, but at the cost of increasing the number of tasks experiencing degradation of SLO by up to ~25% compared to the baseline. When tuning for SLO, by introducing interference-aware colocation, we can tune the solver to reduce tasks experiencing degradation of SLO by up to ~22% compared to the baseline, but at an additional cost of ~30% in terms of the number of hosts. We highlight this trade-off between TCO and SLO violations, and offer tuning based on the requirements of the platform owners.

翻译：由于服务所有者和平台提供者的目标相互冲突,使数据中心受到资源利用效率低下的影响。打算为自己维持服务级目标的服务所有者通常要求保守的资源数量。平台提供者希望提高业务效率,以减少资本和业务费用。同时,实现业务效率和个人服务SLO都具有挑战性,因为服务工作量特点的多样性、依赖投入负荷的资源使用模式、平台、记忆、I/O和网络架构中的异质性以及资源捆绑。本文展示了一种可调用的资源配置方法,其中既考虑到动态服务资源需要,又考虑到平台的异质性。此外,在线基于K- Means的服务分类方法与一个离线敏感部分一起使用。由于服务所有者对SLO的敏感度,实现业务效率提高业务效率,在私人云环境中评估我们可调用的资源调,但在调整业务效率时,我们显示所需机器将削减至50%;在S-OLO的基线任务方面,通过不断降低成本,在S-OLO的降低成本,在S-LO-LO的不断减少。