Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory. A popular and the largest publicly available machine learning MLPerf benchmark on recommendation data is a Deep Learning Recommendation Model (DLRM) trained on a terabyte of click-through data. It contains 100GB of embedding memory (25+Billion parameters). DLRMs, due to their sheer size and the associated volume of data, face difficulty in training, deploying for inference, and memory bottlenecks due to large embedding tables. This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models. We show theoretical upper bounds on the learnable memory requirements for achieving $(1 \pm \epsilon)$ approximations to the embedding table. Our bounds indicate exponentially fewer parameters suffice for good accuracy. To this end, we demonstrate a PSS DLRM reaching 10000$\times$ compression on criteo-tb without losing quality. Such a compression, however, comes with a caveat. It requires 4.5 $\times$ more iterations to reach the same saturation quality. The paper argues that this tradeoff needs more investigations as it might be significantly favorable. Leveraging the small size of the compressed model, we show a 4.3$\times$ improvement in training latency leading to similar overall training times. Thus, in the tradeoff between system advantage of a small DLRM model vs. slower convergence, we show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
翻译:嵌入式表格以工业规模建议模型大小为主,使用存储存储量的百万字节值。在建议数据上,广受欢迎的和可公开使用的最大机器学习 MLPerf 基准是深学习建议模型(DLRM),该模型在点击通过数据方字节上受过培训。该模型包含100GB嵌入存储量(25+亿倍参数)的嵌入式存储器(25+亿字数参数)。DLRM由于其大小和相关的数据量,在培训、调试和因大量嵌入式表格而导致的内存瓶颈方面遇到困难。本文分析和广泛评价了压缩 DSS(PS) 通用参数共享设置(PSS) 用于压缩 DLRM 模型。我们在可学习的存储模型中展示了理论上的上限值($(1\\ pm\ \ \ \ / ipl) ) 向嵌入表表的直嵌入式存储器($ 1) 。我们发现PSS DLRMM 模型的缩放值值为10 000美元, 和存储器的压缩质量。这样压缩的缩缩略化, 将达到小的缩缩到小的缩放的缩放, 的缩放, 需要4.5前的精度的缩化,在排序中显示的精度的精度的精度的精度值值值值值值值值值值值值值值值值值。我们的缩, 显示的缩。