HET:通过缓存驱动分配框架扩大大型嵌入式示范培训 (HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework)

Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68x performance speedup over the state-of-the-art baselines.

翻译：嵌入模型是高维数据的有效学习模式。然而,嵌入模型的一个尚未解决的问题是,其表达方式(延时因素)往往导致巨大的参数空间。我们观察到,现有分布式培训框架自更新和检索服务器的共同嵌入参数以来,面临着嵌入模型的可扩展性问题,通常在培训周期中占主导地位。在本文件中,我们提议了一个新的系统框架,即大大改进大规模嵌入模型培训的可扩展性的新系统框架。我们把嵌入模型的广度分布偏斜,作为一种业绩机会,并利用它解决与嵌入缓存的通信瓶颈问题。为了确保缓存的一致性,我们将一个新的一致性模型纳入HET设计,该设计提供每组的精细度一致性保障。与以往只允许阅读操作的缩略性的工作相比,HET还利用了刻度进行书写操作。对六项具有代表性的任务的评估显示,HET达到88%的嵌入通信缩减和20.68x的状态基线的绩效加速度。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/