Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance.
翻译:推荐模型非常大,训练时需要 TB 级别的内存。为了更好的质量,模型的大小和复杂度随着时间的推移而增长,这需要更多的训练数据以避免过拟合。模型的这种增长需要大量数据中心的资源。因此,训练效率对于管理数据中心的功耗变得非常重要。在 DLRM 中,通过嵌入表捕获分类输入的稀疏特征是模型大小的主要贡献因素,需要高内存带宽。在本文中,我们研究了在实际部署模型中嵌入表的带宽需求和局部性。我们发现,不同表的带宽需求不均匀,而且嵌入表表现出高时间局部性。然后,我们设计了 MTrainS,它使用包括 Byte 和块寻址存储类存储器在内的异构存储器体系结构逐级地处理了 DLRM。MTrainS 允许每个节点具有更高的内存容量,并通过降低在内存容量受限情况下扩展到多个主机的需求,提高了训练效率。通过优化平台内存体系结构,我们将训练所需的节点数降低了 4-8 倍,节省了训练的功耗和成本,同时实现了目标训练性能。