Deep learning recommendation models (DLRMs) are widely used in industry, and their memory capacity requirements reach the terabyte scale. Tiered memory architectures provide a cost-effective solution but introduce challenges in embedding-vector placement due to complex embedding-access patterns. We propose RecMG, a machine learning (ML)-guided system for vector caching and prefetching on tiered memory. RecMG accurately predicts accesses to embedding vectors with long reuse distances or few reuses. The design of RecMG focuses on making ML feasible in the context of DLRM inference by addressing unique challenges in data labeling and navigating the search space for embedding-vector placement. By employing separate ML models for caching and prefetching, plus a novel differentiable loss function, RecMG narrows the prefetching search space and minimizes on-demand fetches. Compared to state-of-the-art temporal, spatial, and ML-based prefetchers, RecMG reduces on-demand fetches by 2.2x, 2.8x, and 1.5x, respectively. In industrial-scale DLRM inference scenarios, RecMG effectively reduces end-to-end DLRM inference time by up to 43%.
翻译:深度学习推荐模型(DLRM)在工业界广泛应用,其内存容量需求已达太字节级别。层级内存架构提供了一种经济高效的解决方案,但由于复杂的嵌入向量访问模式,其在嵌入向量放置方面带来了挑战。我们提出了RecMG,一种基于机器学习(ML)引导的层级内存向量缓存与预取系统。RecMG能够准确预测具有长重用距离或低重用次数的嵌入向量访问。RecMG的设计专注于通过解决数据标注的独特挑战,并探索嵌入向量放置的搜索空间,使机器学习在DLRM推理场景中变得可行。通过采用独立的机器学习模型分别处理缓存与预取,并结合一种新颖的可微分损失函数,RecMG缩小了预取搜索空间并最小化了按需获取操作。与最先进的基于时序、空间及机器学习的预取器相比,RecMG分别将按需获取次数减少了2.2倍、2.8倍和1.5倍。在工业级DLRM推理场景中,RecMG将端到端DLRM推理时间有效降低了最高达43%。