ASCFreeCTR:基于混合缓存的分布式培训系统,用于具有巨型嵌入表的 CTR 模型 (ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table)

Because of the superior feature representation ability of deep learning, various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently, which makes speeding up the training process an essential problem. Different from the models with dense training data, the training data for CTR models is usually high-dimensional and sparse. To transform the high-dimensional sparse input into low-dimensional dense real-value vectors, almost all deep CTR models adopt the embedding layer, which easily reaches hundreds of GB or even TB. Since a single GPU cannot afford to accommodate all the embedding parameters, when performing distributed training, it is not reasonable to conduct the data-parallelism only. Therefore, existing distributed training platforms for recommendation adopt model-parallelism. Specifically, they use CPU (Host) memory of servers to maintain and update the embedding parameters and utilize GPU worker to conduct forward and backward computations. Unfortunately, these platforms suffer from two bottlenecks: (1) the latency of pull \& push operations between Host and GPU; (2) parameters update and synchronization in the CPU servers. To address such bottlenecks, in this paper, we propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models. Specifically, in SFCTR, we also store huge embedding table in CPU but utilize GPU instead of CPU to conduct embedding synchronization efficiently. To reduce the latency of data transfer between both GPU-Host and GPU-GPU, the MixCache mechanism and Virtual Sparse Id operation are proposed. Comprehensive experiments and ablation studies are conducted to demonstrate the effectiveness and efficiency of SFCTR.

翻译：由于深层学习具有超强特征代表能力,工业公司在商业系统中部署了各种深点滴率(CTR)模型。为了实现更好的业绩,有必要对深点CTR模型进行高效培训数据数量巨大的培训,这使得加快培训进程成为一个基本问题。与密集培训数据的模型不同,CTR模型的培训数据通常是高维和稀疏的。要将高维稀释输入转化为低维密度实际值矢量,几乎所有深点CTR模型都采用嵌入层,这很容易达到数百GB甚至肺结核。由于单个GPU无法满足所有嵌入参数,因此在开展分布式培训时,仅进行数据平行化培训过程是一个基本问题。因此,现有的分布式培训平台采用模型参数。具体地说,它们使用 CPU (Host) 服务器存储来维持和更新嵌入参数,并利用GPULSLL的运行速度计算。不幸的是,这些平台存在两个瓶颈:(1) 在运行的C-TR服务器和GPULG服务器之间拉动的递升操作系统;在运行中,STR服务器的运行参数更新和升级。