Datacenter congestion management protocols must navigate the throughput-latency buffering trade-off in the presence of growing constraints due to switching hardware trends, oversubscribed topologies, and varying network configurability and features. In this context, receiver-driven protocols, which schedule packet transmissions instead of reacting to congestion, have shown great promise and work exceptionally well when the bottleneck lies at the ToR-to-receiver link. However, independent receiver schedules may collide if a shared link is the bottleneck instead. We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled while shared links should be managed through traditional congestion control algorithms. The approach achieves the best of both worlds by allowing precise control of the most common bottleneck and robust bandwidth sharing for shared bottlenecks. SIRD is implemented by end hosts and does not depend on Ethernet priorities or extensive network configuration. We compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is the only one that can consistently maximize link utilization, minimize queuing, and obtain near-optimal latency across a wide set of workloads and traffic patterns. SIRD causes 12x less peak buffering than Homa and achieves competitive latency and utilization without requiring Ethernet priorities. Unlike dcPIM, SIRD operates without latency-inducing message exchange rounds and outperforms it in utilization, buffering, and tail latency by 9%, 43%, and 46% respectively. Finally, SIRD achieves 10x lower tail latency and 26% higher utilization than ExpressPass.
翻译:暂无翻译