Hyperscalars run services across a large fleet of servers, serving billions of users worldwide. These services, however, behave differently than commonly available benchmark suites, resulting in server architectures that are not optimized for cloud workloads. With datacenters becoming a primary server processor market, optimizing server processors for cloud workloads by better understanding their behavior has become crucial. To address this, in this paper, we present MemProf, a memory profiler that profiles the three major reasons for stalls in cloud workloads: code-fetch, memory bandwidth, and memory latency. We use MemProf to understand the behavior of cloud workloads and propose and evaluate micro-architectural and memory system design improvements that help cloud workloads' performance. MemProf's code analysis shows that cloud workloads execute the same code across CPU cores. Using this, we propose shared micro-architectural structures--a shared L2 I-TLB and a shared L2 cache. Next, to help with memory bandwidth stalls, using workloads' memory bandwidth distribution, we find that only a few pages contribute to most of the system bandwidth. We use this finding to evaluate a new high-bandwidth, small-capacity memory tier and show that it performs 1.46$\times$ better than the current baseline configuration. Finally, we look into ways to improve memory latency for cloud workloads. Profiling using MemProf reveals that L2 hardware prefetchers, a common solution to reduce memory latency, have very low coverage and consume a significant amount of memory bandwidth. To help improve hardware prefetcher performance, we built a memory tracing tool to collect and validate production memory access traces.
翻译:超超大卡路里在众多服务器中运行服务, 为全世界数十亿用户提供服务。 然而, 这些服务的表现与普通的基准套件不同, 导致服务器结构无法优化以适应云量工作量。 随着数据中心成为主要的服务器处理器市场, 通过更好地了解他们的行为, 优化服务器处理器以适应云量。 为了解决这个问题, 我们在此提供MemProf, 一个内存配置器, 描述云量停滞的三个主要原因: 代码扩展、 记忆带宽和记忆延缓 。 我们使用MemProf来理解云量工作量的行为, 并提议和评估微结构架构和记忆系统设计改进, 以帮助云量处理云量工作量。 MemProf的代码分析显示, 云量在CPU 核心中执行相同的代码。 使用此选项, 我们提议共享的微结构结构- 共享的 L2 I- TLB 和 共享的 L2 隐藏点 。 接下来, 我们用存储的频带宽度来帮助存储系统, 使用大量存储带宽度分配, 我们发现只有几页的内存内存范围, 显示最高级的内存量的内脏 。</s>