基于有界延迟同步集合操作的更快速分布式纯推理推荐系统 (Faster Distributed Inference-Only Recommender Systems via Bounded Lag Synchronous Collectives)

Recommender systems are enablers of personalized content delivery, and therefore revenue, for many large companies. In the last decade, deep learning recommender models (DLRMs) are the de-facto standard in this field. The main bottleneck in DLRM inference is the lookup of sparse features across huge embedding tables, which are usually partitioned across the aggregate RAM of many nodes. In state-of-the-art recommender systems, the distributed lookup is implemented via irregular all-to-all (alltoallv) communication, and often presents the main bottleneck. Today, most related work sees this operation as a given; in addition, every collective is synchronous in nature. In this work, we propose a novel bounded lag synchronous (BLS) version of the alltoallv operation. The bound can be a parameter allowing slower processes to lag behind entire iterations before the fastest processes block. In special applications such as inference-only DLRM, the accuracy of the application is fully preserved. We implement BLS alltoallv in a new PyTorch Distributed backend and evaluate it with a BLS version of the reference DLRM code. We show that for well balanced, homogeneous-access DLRM runs our BLS technique does not offer notable advantages. But for unbalanced runs, e.g. runs with strongly irregular embedding table accesses or with delays across different processes, our BLS technique improves both the latency and throughput of inference-only DLRM. In the best-case scenario, the proposed reduced synchronisation can mask the delays across processes altogether.

翻译：推荐系统是实现个性化内容推送的关键技术，从而为众多大型公司创造收入。过去十年中，深度学习推荐模型已成为该领域的事实标准。DLRM推理的主要瓶颈在于从海量嵌入表中查找稀疏特征，这些嵌入表通常分布在多个节点的聚合内存中。在先进的推荐系统中，分布式查找通过不规则的全对全通信实现，且往往成为主要性能瓶颈。目前多数相关研究将此操作视为既定前提；此外，所有集合操作本质上都是同步的。本研究提出一种创新的有界延迟同步版本全对全通信操作。该延迟界限可作为参数，允许较慢进程在最快进程被阻塞前滞后完整迭代周期。在纯推理DLRM等特定应用中，该方法能完全保持计算精度。我们在新型PyTorch分布式后端中实现了BLS全对全通信，并采用BLS版本的参考DLRM代码进行评估。实验表明：对于负载均衡、访问同质的DLRM运行场景，BLS技术未显现显著优势；但在非均衡运行场景中（例如存在高度不规则嵌入表访问或跨进程延迟的情况），BLS技术能同时提升纯推理DLRM的延迟与吞吐性能。在最优场景下，所提出的弱化同步机制可完全掩盖跨进程延迟。