Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of almost 6x over the prior state-of-the-art approach.
翻译:以序列为基础的深层次学习建议模型(DLRMs)是一个新兴的DLRM(DLRMs)类别,显示在捕捉用户的长期利益时,比先前基于总和的对口单位大有改进。但是,这些改进以巨大的系统成本为代价,而基于序列的DLRM则要求每个加速器在一次迭代中动态化和传播大量数据。为了解决这一迅速增长的瓶颈问题,我们提出了FlexShard,这是一个新的分层嵌入表格的分级缩放算法,它通过利用并非每行都相等的洞察力,在利用基于其潜在概率分布的嵌入行的基础上进行精确复制,同时采用适应异性、真实世界集束网络表的偏斜性的新裁剪裁战略,FlexShard能够大量减少通信需求,同时不使用与先前的状态相比更多的记忆。在对生产规模的DLRMMs序列进行评估时,FlexShard能够通过近85 %的升级前端通信方式减少全球全面通信流量。