In this paper, we conduct systematic measurement studies to show that the high memory bandwidth consumption of modern distributed applications can lead to a significant drop of network throughput and a large increase of tail latency in high-speed RDMA networks.We identify its root cause as the high contention of memory bandwidth between application processes and network processes. This contention leads to frequent packet drops at the NIC of receiving hosts, which triggers the congestion control mechanism of the network and eventually results in network performance degradation. To tackle this problem, we make a key observation that given the distributed storage service, the vast majority of data it receives from the network will be eventually written to high-speed storage media (e.g., SSD) by CPU. As such, we propose to bypass host memory when processing received data to completely circumvent this performance bottleneck. In particular, we design Lamda, a novel receiver cache processing system that consumes a small amount of CPU cache to process received data from the network at line rate. We implement a prototype of Lamda and evaluate its performance extensively in a Clos-based testbed. Results show that for distributed storage applications, Lamda improves network throughput by 4.7% with zero memory bandwidth consumption on storage nodes, and improves network throughput by up 17% and 45% for large block size and small size under the memory bandwidth pressure, respectively. Lamda can also be applied to latency-sensitive HPC applications, which reduces their communication latency by 35.1%.
翻译:在本文中,我们进行了系统的测量研究,展示了现代分布式应用程序的高内存带宽消耗会导致高速RDMA网络中网络吞吐量显著下降和尾延迟增加的问题。我们确定其根本原因是应用程序进程和网络进程之间内存带宽的高争用性。这种争用导致接收主机的NIC经常出现数据包丢失,从而触发网络的拥塞控制机制,并最终导致网络性能降低。为了解决这个问题,我们做出了一个关键的观察,即在分布式存储服务提供的情况下,它从网络中接收到的绝大部分数据最终都将由CPU写入高速存储介质(例如SSD)。因此,我们提出了一个绕过主机内存处理接收到的数据的方案,完全避开这个性能瓶颈。特别地,我们设计了Lambda,一种新颖的接收器缓存处理系统,在消耗一小部分CPU缓存的情况下,以线速率处理从网络接收到的数据。我们在基于Clos的测试平台上实现了Lambda的原型,并对其性能进行了广泛评估。结果表明,对于分布式存储应用程序,Lambda在存储节点上消耗零内存带宽的情况下将网络吞吐量提高了4.7%,并在内存带宽压力下,针对大块大小和小块大小将网络吞吐量分别提高了17%和45%。 Lambda也适用于延迟敏感的HPC应用程序,将其通信延迟降低了35.1%。