Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.
翻译:深层学习推荐系统是大多数个性化云层服务的主干。 虽然计算机架构界最近开始注意到深层建议推理, 但由此产生的解决方案却采取了截然不同的方法 — 从近内处理到局部优化。 为了更好地设计未来深建议推理硬件系统, 我们首先必须系统地检查和描述执行堆叠不同级别设计决定的内在系统影响。 在本文中, 我们将八个行业代表的深建议模型描述为执行堆叠的三个不同层次: 算法和软件、 系统平台和硬件微结构。 通过这一跨堆的描述, 我们首先显示系统部署选择( 即, CPU 或 GPUPS, 批量尺寸颗粒度) 可以给我们15x速度。 为了更好地了解进一步优化的瓶颈, 我们查看软件操作员使用解密和 CPU 前端及后端微结构缺陷的瓶颈。 最后, 我们将关键算法模型结构特征与硬件瓶颈的关联性, 揭示每个硬件瓶壳后面缺少一个单一的主导算法组成部分。