Most existing datacenter transport protocols rely on in-order packet delivery, a design choice rooted in legacy systems and simplicity. However, advancements in technology, such as RDMA, have made it feasible to relax this requirement, allowing for more effective use of modern datacenter topologies like FatTree and Dragonfly. The rise of AI/ML workloads underscores the necessity for enhanced link utilization, a challenge for single-path load balancers due to issues like ECMP collisions. In this paper, we introduce REPS, a novel per-packet traffic load-balancing algorithm that integrates seamlessly with existing congestion control mechanisms. REPS reroutes packets around congested hotspots and unreliable or failing links with remarkable simplicity and minimal state requirements. Our evaluation demonstrates that REPS significantly outperforms traditional packet spraying and other state-of-the-art solutions in datacenter networks, offering substantial improvements in performance and link utilization.
翻译:暂无翻译