Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. In order to produce a personalized CTR prediction, an industry-level CTR prediction model commonly takes a high-dimensional (e.g., 100 or 1000 billions of features) sparse vector (that is encoded from query keywords, user portraits, etc.) as input. As a result, the model requires Terabyte scale parameters to embed the high-dimensional input. Hierarchical distributed GPU parameter server has been proposed to enable GPU with limited memory to train the massive network by leveraging CPU main memory and SSDs as secondary storage. We identify two major challenges in the existing GPU training framework for massive-scale ad models and propose a collection of optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly communicate with each other during the training. The connections between GPUs and CPUs are non-uniform due to the hardware topology. The data communication route should be optimized according to the hardware topology; (b) GPUs in different computing nodes frequently communicates to synchronize parameters. We are required to optimize the communications so that the distributed system can become scalable. In this paper, we propose a hardware-aware training workflow that couples the hardware topology into the algorithm design. To reduce the extensive communication between computing nodes, we introduce a $k$-step model merging algorithm for the popular Adam optimizer and provide its convergence rate in non-convex optimization. To the best of our knowledge, this is the first application of $k$-step adaptive optimization method in industrial-level CTR model training. The numerical results on real-world data confirm that the optimized system design considerably reduces the training time of the massive model, with essentially no loss in accuracy.
翻译:点击浏览率( CTR) 预测是在线广告业的一个关键组成部分。 为了生成个性化 CTR 预测, 产业一级的 CTR 预测模型通常需要高维( 例如, 100或1000亿个特性) 的矢量( 由查询关键字、 用户肖像等编码) 作为输入。 因此, 该模型需要Terabyte 级参数来嵌入高维输入。 已经提议了等级分布式 GPU 参数服务器, 以便利用 CPU 主内存和 SSD 来利用 个人化的 CTR 预测, 来培训大型网络。 我们发现现有的 GPU 大规模广告模型中的两大挑战( 例如, 100或1000亿个特性), 并提议收集最优化的矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量的矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量。 GPUPUPS 和CPUPS和CPPUPS 之间的连接由于硬件表情量输入, 的连接的连接是非不统一的。 。 数据流传输路径路路路路路路路,, 的精精精精精精度的电, 以提供最量精度的精度的精度的精度的精度的精度操作法应用方法, 。