Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms. On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.
翻译:为了做到这一点,我们根据对各种嵌入式的描述,提议一种混合嵌入式代表制,以更高质量的嵌入方式,以增加记忆和计算要求为代价。为了应对混合代表制的系统性能挑战,我们提议MP-Rec -- -- 一种共同设计技术,利用嵌入式代表制和基本硬件平台的异质性和动态选择。在实际系统模型硬件方面,我们展示了如何将定制化的加速器(即:GPUs、TPPS和ICUs)与16.65级标准加速器(即:GPUs、TPUs和NUBs)相匹配。