Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.
翻译:为了实现峰值预测准确性,现代建议模型将深层次学习与百万分层嵌入表结合起来,以获得精确的原始数据表示。传统的推论服务结构要求将整个模型用于独立的服务器,这种系统在如此大规模上是行不通的。在本文中,我们提供了对在线建议系统令人感兴趣和具有挑战性的推论域的洞察力。我们提议了HUGECTR高端参数服务器(HPS),这是一个行业牵头分布式推论框架,将高性能的GPU嵌入缓存与分层存储结构结合起来,以便实现低延迟性地检索嵌入的在线模型推断任务。除其他外,HPS特征(1)一个冗余的等级存储系统,(2)一个新的高带宽度缓存系统,以加速在NVIDIA GPUs上平行的查找,(3) 在线培训支持和(4) 轻度的APIS分布式推介式建议,同时通过高压性水平的合成GPS进行我们现有大规模配置的升级阶段性建议,从而大大降低我们现有高层次的升级性建议。