OptiNIC：面向分布式机器学习工作负载的弹性且尾部最优的RDMA网卡 (OptiNIC: A Resilient and Tail-Optimal RDMA NIC for Distributed ML Workloads)

As distributed machine learning (ML) workloads scale to thousands of GPUs connected by high-speed interconnects, tail latency in collective communication has become a major bottleneck. Existing RDMA transports, such as RoCE, IRN, SRNIC, and Falcon, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While these approaches work well for general-purpose workloads, they introduce complexity and latency that scale poorly in ML, where even rare packet delays can stall entire model pipelines. We present OptiNIC, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for partial or missing data. OptiNIC eliminates retransmissions and in-order delivery from the NIC, enabling a best-effort, out-of-order transport model for RDMA. Unlike traditional RDMA, which signals completion only after complete data delivery, OptiNIC introduces adaptive timeouts to trigger forward progress when data may be lost or delayed. OptiNIC retains standard congestion control mechanisms (e.g., DCQCN, EQDS, or Swift) while shifting loss recovery to the ML pipeline itself (e.g., via the Hadamard Transform and Erasure Coding). Our evaluation shows that OptiNIC improves time-to-accuracy (TTA) by 2x and increases throughput by 1.6x for training and inference, respectively, across two public clouds (i.e., Hyperstack and CloudLab). OptiNIC also lowers 99th-percentile latency by 3.5x, cuts BRAM usage by 2.7x, and nearly doubles NIC resilience to faults-delivering a resilient, tail-optimized RDMA transport purpose-built for distributed ML workloads.

翻译：随着分布式机器学习工作负载扩展到由高速互连连接的数千个GPU，集体通信中的尾部延迟已成为主要瓶颈。现有的RDMA传输协议，如RoCE、IRN、SRNIC和Falcon，强制实施严格的可靠性和有序交付，依赖重传和报文排序来确保正确性。虽然这些方法在通用工作负载上表现良好，但它们引入了复杂性以及难以适应ML场景的延迟，因为在ML中，即使是罕见的报文延迟也可能阻塞整个模型流水线。我们提出了OptiNIC，这是一种面向特定领域的RDMA传输协议，它基于ML对部分或缺失数据的容忍性，重新审视了传统的可靠性保证。OptiNIC在网卡层面消除了重传和有序交付的要求，实现了尽力而为、乱序的RDMA传输模型。与仅在数据完全交付后才发出完成信号的传统RDMA不同，OptiNIC引入了自适应超时机制，以便在数据可能丢失或延迟时触发前向进展。OptiNIC保留了标准的拥塞控制机制（例如DCQCN、EQDS或Swift），同时将丢失恢复转移到ML流水线本身（例如通过哈达玛变换和纠删码）。我们的评估表明，在两个公共云（即Hyperstack和CloudLab）上，OptiNIC分别将训练和推理的准确度达成时间缩短了2倍，并将吞吐量提高了1.6倍。OptiNIC还将第99百分位延迟降低了3.5倍，将BRAM使用量减少了2.7倍，并将网卡对故障的弹性提升了近一倍——从而为分布式ML工作负载提供了一个专门构建的、弹性的、尾部优化的RDMA传输方案。