Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.
翻译:广告系统的深层次学习模式通常会包含与 GPU 内存或计算节点上的 CPU 主内存不相容的边际参数。 例如, 受赞助的在线广告系统可以包含超过 10 11 美元 的稀有功能, 使神经网络成为拥有10个TB参数的大型模型。 在本文中, 我们为大规模深层学习广告系统引入一个分布式的 GPU 等级参数服务器, 通常每个示例中只有一小部分非零特征值。 我们建议使用 GPU 高宽线记忆、 CPU 主内存和 SSD 3 级存储的分级流程。 所有神经网络培训计算都包含在 GPUs 中。 对真实世界数据的广泛实验可以确认系统的有效性和可扩展性。 4 4 级GPU 参数服务器可以对大规模深层学习广告系统进行比 2- 9 类服务器的模型化比 MIPMD 4 系统要快150 。