Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68x performance speedup over the state-of-the-art baselines.
翻译:嵌入模型是高维数据的有效学习模式。然而,嵌入模型的一个尚未解决的问题是,其表达方式(延时因素)往往导致巨大的参数空间。我们观察到,现有分布式培训框架自更新和检索服务器的共同嵌入参数以来,面临着嵌入模型的可扩展性问题,通常在培训周期中占主导地位。在本文件中,我们提议了一个新的系统框架,即大大改进大规模嵌入模型培训的可扩展性的新系统框架。我们把嵌入模型的广度分布偏斜,作为一种业绩机会,并利用它解决与嵌入缓存的通信瓶颈问题。为了确保缓存的一致性,我们将一个新的一致性模型纳入HET设计,该设计提供每组的精细度一致性保障。与以往只允许阅读操作的缩略性的工作相比,HET还利用了刻度进行书写操作。对六项具有代表性的任务的评估显示,HET达到88%的嵌入通信缩减和20.68x的状态基线的绩效加速度。